Opened 2 years ago

Closed 2 years ago

#2767 closed help (fixed)

Problems with restarting a model run

Reported by: aschurer Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version: 11.0


Hi, I am running a UKESM experiment: u-bf095

A couple of weeks ago - it stopped (after 21 model years). With, as far as I can see, no obvious error message in the log files.

Consequently I tried to restart it from its current point using:
rose suite-run —restart

This started the simulation again but very soon after submission coupled_rigorous failed with the following error message:

atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 06490] [c5-2c2s6n2] [Tue Feb 12 15:33:47 2019] PE RANK 1256 exit signal Aborted
[NID 06490] 2019-02-12 15:33:47 Apid 54653819: initiated application termination
[FAIL] run_model # return-code=137
2019-02-12T15:33:54Z CRITICAL - failed/EXIT

Can you please advise what the problem could be and how I can restart the job?

Many thanks,

Change History (4)

comment:1 Changed 2 years ago by grenville


Which cycle?


comment:2 Changed 2 years ago by aschurer

Hi Grenville,

Sorry I think I can see the problem now.

The cycle that failed was 18500401T0000Z which it should not have been running anyway.

It is also running the correct cycle from 1871 e.g. 18711001T0000Z which has worked OK.

How do I fix this situation so it does not happen again? Do I need to remove stale directories from cylc-run/u-bf095/work/ ?


comment:3 Changed 2 years ago by grenville


Is this OK now?


comment:4 Changed 2 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.