Opened 2 months ago

Closed 4 weeks ago

#2767 closed help (fixed)

Problems with restarting a model run

Reported by: aschurer Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version: 11.0

Description

Hi, I am running a UKESM experiment: u-bf095

A couple of weeks ago - it stopped (after 21 model years). With, as far as I can see, no obvious error message in the log files.

Consequently I tried to restart it from its current point using:
rose suite-run —restart

This started the simulation again but very soon after submission coupled_rigorous failed with the following error message:

atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 06490] [c5-2c2s6n2] [Tue Feb 12 15:33:47 2019] PE RANK 1256 exit signal Aborted
[NID 06490] 2019-02-12 15:33:47 Apid 54653819: initiated application termination
[FAIL] run_model # return-code=137
2019-02-12T15:33:54Z CRITICAL - failed/EXIT

Can you please advise what the problem could be and how I can restart the job?

Many thanks,
Andrew

Change History (4)

comment:1 Changed 2 months ago by grenville

Andrew

Which cycle?

Grenville

comment:2 Changed 2 months ago by aschurer

Hi Grenville,

Sorry I think I can see the problem now.

The cycle that failed was 18500401T0000Z which it should not have been running anyway.

It is also running the correct cycle from 1871 e.g. 18711001T0000Z which has worked OK.

How do I fix this situation so it does not happen again? Do I need to remove stale directories from cylc-run/u-bf095/work/ ?

Thanks.
Andrew

comment:3 Changed 7 weeks ago by grenville

Andrew

Is this OK now?

Grenville

comment:4 Changed 4 weeks ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.