Opened 6 months ago

Closed 6 months ago

#3053 closed help (fixed)

Suite failing after decades

Reported by: ChrisWells Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version:

Description

Hi,

A suite, u-bm505, has failed mysteriously after running for >100 years and I'm unsure why; this is the error at the end of job.err:

Rank 602 [Thu Oct 24 17:06:34 2019] [c5-1c1s10n2] application called MPI_Abort(comm=0xC4000003, 1) - process 602
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 03754] [c5-1c1s10n2] [Thu Oct 24 17:10:32 2019] PE RANK 589 exit signal Aborted
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
[NID 03754] 2019-10-24 17:10:32 Apid 84031174: initiated application termination
[FAIL] run_model # return-code=137

I have similar runs which haven't failed, and restarting the suite gave the same error.

Sorry that I haven't got very far in figuring this out - do you know what I should do?

Cheers,
Chris

Change History (6)

comment:1 Changed 6 months ago by willie

  • Platform set to Monsoon2

Hi Chris,

It failed in the coupled task at 2290101T000Z with an MPI_Abort in processor 599. In /home/d00/chwel/cylc-run/u-bm505/work/22900101T0000Z/coupled, NaNs have been detected in the ocean.output. The last file written is `debug.notroot.02' and it says OASIS aborted with the message "NEMO initiated abort".

There are a couple of files called core* in that directory. Could you give me read permission on those please,

chmod g+r core*

These seem old but might give a further clue. Also, can you say what time roughly you did the restart.

Willie

comment:2 Changed 6 months ago by ChrisWells

Hi Willie,

I've done the permissions just now - I did the restart (just stopped the suite in the gui after coupled had failed and ran rose suite-run —restart) when I noticed it had failed, mid-afternoon yesterday.

Cheers,
Chris

comment:3 Changed 6 months ago by willie

Hi Chris,

Thanks.The core file is from ocean.exe at 4am on the 24th. The next thing that happens occurs at 17:33 on the 24th as a result of the restart.

I think the ocean model has blown up. In cycle 22900101 the ocean.output file says,

NaN check
  
 q3 =  NaN
 NaN detected

and this extends back several cycles to 22890401T0000Z.

So I think you need to go back to three months before that and set up the model to run from there. You can find advice on how to do this at https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral (Restarting if the model blows up).

As soon as the first cycle completes, check the ocean.output file to see if NaNs have been detected - I am not sure if it always says NaNs detected.

As a general rule, when the model fails try to find the fundamental cause, rather than just using restart. Restart should only be used if the model has succesfully completed and you just want to continue.

Willie

comment:4 Changed 6 months ago by ChrisWells

Hi Willie,

Many thanks for that. In that link it says to find the corresponding atmos restart file for the time to restart from, but I need 228901, and in

/home/d00/chwel/cylc-run/u-bm505/share/data/History_Data

the files only go back to 228904 - if I use this, the error may just repeat? What should I use?

Cheers,
Chris

comment:5 Changed 6 months ago by willie

Hi Chris,

It should be in your archive under moose:/crum/u-bm505/ada.file/bm505a.da22890101_00.

Willie

comment:6 Changed 6 months ago by ChrisWells

  • Resolution set to fixed
  • Status changed from new to closed

Hi Willie,

Of course! Thanks, I've done that - I'll close this now, in the hopes that has fixed it.

Cheers,
Chris

Note: See TracTickets for help on using tickets.