Opened 2 years ago
Closed 2 years ago
#2674 closed help (answered)
Problem with re-submission
Reported by: | aschurer | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | Monsoon2 | |
UM Version: | 11.0 |
Description
Hello,
I am running a UKESM experiment: u-ba056
It completes the first cycle OK but does not start the second.
I am currently trying to complete 1 model year.
I have first set the cycling frequency to 3 months - and it ran for 3 months and had an error before starting the fourth month.
I then set the cycling frequency to 1 month - and it only ran for 1 month before stopping. So the problem does not seem related to any particular month.
The error message from the second "coupled" process contained the following lines:
Rank 1248 [Tue Nov 13 15:13:16 2018] [c11-2c1s7n2] application called MPI_Abort(comm=0xC4000006, 1) - process 1248
Application 45579353 is crashing. ATP analysis proceeding…
Rank 1152 [Tue Nov 13 15:13:16 2018] [c11-2c1s6n3] application called MPI_Abort(comm=0xC4000009, 1) - process 1152
………
………
Rank 1221 [Tue Nov 13 15:13:19 2018] [c11-2c1s7n1] application called MPI_Abort(comm=0xC4000003, 1) - process 1221
Rank 1222 [Tue Nov 13 15:13:19 2018] [c11-2c1s7n1] application called MPI_Abort(comm=0xC4000003, 1) - process 1222
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 07582] [c11-2c1s7n2] [Tue Nov 13 15:17:18 2018] PE RANK 1244 exit signal Aborted
[NID 07582] 2018-11-13 15:17:18 Apid 45579353: initiated application termination
[FAIL] run_model # return-code=137
2018-11-13T15:17:28Z CRITICAL - failed/EXIT
I'm not sure what the problem is so any help will be gratefully received.
Thanks,
Andrew
Change History (6)
comment:1 Changed 2 years ago by grenville
comment:2 Changed 2 years ago by aschurer
Hi Grenville,
Thanks for your suggestion.
I decided to delete the whole directory /home/d05/aschurer/cylc-run/u-ba056 to avoid any potential conflicts and start with a clean job.
I've resubmitted the experiment and now get an error in install_ancil which I have not seen before:
[FAIL] [Errno 13] Permission denied: '/home/d05/aschurer/cylc-run/u-ba056/share/data/etc/nemo_ancils_gl'
[FAIL] install: /home/d05/aschurer/cylc-run/u-ba056/share/data/etc/nemo_ancils_gl
[FAIL] source: /projects/ocean/hadgem3/ancil/ancil_versions/GC3.1/GC3.1_eORCA1v2.2x_nemo_ancils_vn10p9
2018-11-14T11:02:01Z CRITICAL - failed/EXIT
I've checked and I can definitely read the file /home/d05/aschurer/cylc-run/u-ba056/share/data/etc/nemo_ancils_gl from the command line on monsoon.
Can you advise as to what I've done wrong.
Many thanks,
Andrew
comment:3 Changed 2 years ago by grenville
Andrew
Please try to retrigger the failed task - install_ancil worked previously; I can't see why it would fail now
Grenville
comment:4 Changed 2 years ago by grenville
Andrew
Did re-triggering work?
comment:5 Changed 2 years ago by aschurer
Hi Grenville,
Sorry for not replying sooner.
I tried to re-trigger the task several times but always ran into the same problem.
I decided to copy the whole experiment to start afresh (now u-bd077).
Not sure what the problem was but this seemed to solve it.
Thanks,
Andrew
comment:6 Changed 2 years ago by grenville
- Resolution set to answered
- Status changed from new to closed
Thanks for letting us know.
Andrew
The ocean drivers get confused if the work directory contains data from multiple different cycle lengths. Delete everything in /home/d05/aschurer/cylc-run/u-ba056/work and run again — it may still fail, but if so, diagnose the problem at that point without rerunning at a different cycle length.
Grenville