Opened 2 years ago
Closed 22 months ago
#2767 closed help (fixed)
Problems with restarting a model run
Reported by: | aschurer | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | Monsoon2 | |
UM Version: | 11.0 |
Description
Hi, I am running a UKESM experiment: u-bf095
A couple of weeks ago - it stopped (after 21 model years). With, as far as I can see, no obvious error message in the log files.
Consequently I tried to restart it from its current point using:
rose suite-run —restart
This started the simulation again but very soon after submission coupled_rigorous failed with the following error message:
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 06490] [c5-2c2s6n2] [Tue Feb 12 15:33:47 2019] PE RANK 1256 exit signal Aborted
[NID 06490] 2019-02-12 15:33:47 Apid 54653819: initiated application termination
[FAIL] run_model # return-code=137
2019-02-12T15:33:54Z CRITICAL - failed/EXIT
Can you please advise what the problem could be and how I can restart the job?
Many thanks,
Andrew
Change History (4)
comment:1 Changed 2 years ago by grenville
comment:2 Changed 2 years ago by aschurer
Hi Grenville,
Sorry I think I can see the problem now.
The cycle that failed was 18500401T0000Z which it should not have been running anyway.
It is also running the correct cycle from 1871 e.g. 18711001T0000Z which has worked OK.
How do I fix this situation so it does not happen again? Do I need to remove stale directories from cylc-run/u-bf095/work/ ?
Thanks.
Andrew
comment:3 Changed 23 months ago by grenville
Andrew
Is this OK now?
Grenville
comment:4 Changed 22 months ago by willie
- Resolution set to fixed
- Status changed from new to closed
Andrew
Which cycle?
Grenville