Opened 4 days ago

Last modified 4 days ago

#2767 new help

Problems with restarting a model run

Reported by: aschurer Owned by: um_support
Priority: normal Component: UM Model
Keywords: Cc:
Platform: Monsoon2 UM Version: 11.0

Description

Hi, I am running a UKESM experiment: u-bf095

A couple of weeks ago - it stopped (after 21 model years). With, as far as I can see, no obvious error message in the log files.

Consequently I tried to restart it from its current point using:
rose suite-run —restart

This started the simulation again but very soon after submission coupled_rigorous failed with the following error message:

atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 06490] [c5-2c2s6n2] [Tue Feb 12 15:33:47 2019] PE RANK 1256 exit signal Aborted
[NID 06490] 2019-02-12 15:33:47 Apid 54653819: initiated application termination
[FAIL] run_model # return-code=137
2019-02-12T15:33:54Z CRITICAL - failed/EXIT

Can you please advise what the problem could be and how I can restart the job?

Many thanks,
Andrew

Change History (2)

comment:1 Changed 4 days ago by grenville

Andrew

Which cycle?

Grenville

comment:2 Changed 4 days ago by aschurer

Hi Grenville,

Sorry I think I can see the problem now.

The cycle that failed was 18500401T0000Z which it should not have been running anyway.

It is also running the correct cycle from 1871 e.g. 18711001T0000Z which has worked OK.

How do I fix this situation so it does not happen again? Do I need to remove stale directories from cylc-run/u-bf095/work/ ?

Thanks.
Andrew

Note: See TracTickets for help on using tickets.