Opened 5 years ago

Closed 4 years ago

#1311 closed help (completed)

Error in coupled HadGEM3-AO simulation: The latest NEMO restart dump does not seem to be consistent with the UM .phist file

Reported by: luke Owned by: um_support
Component: UM Model Keywords: NEMO/CICE, OASIS
Cc: pjn Platform: MONSooN
UM Version: 7.3

Description

Hello,

Both Peer (pjn) and myself are having problems to get a HadGEM3-AO N48L60 simulation to run for more than a few years. The final error causing the job to crash is

ERROR: The latest NEMO restart dump does not seem to be
       consistent with the UM .phist file
       This suggests an untidy model failure but you
       may be able to retrieve the run by modifying UM
       history files or deleting the latest NEMO dumps
       as appropriate
---------------------------------------------------------
qsexecute: problem executing NEMO setup script

An example job is xjmkk.

While this appears to be a similar problem as ticket #1212 I'm not sure if it is the same, as the error still occurs when there appears to be enough disk space remaining (either that, or space gets used then quickly freed-up again).

Looking through the .leave files, I think that the error may in fact be coming from the previous job-step. Looking at the file sizes of the final few files gives

-rw-r--r--. 1 nlabra users 9358264 Jun 11 06:05 xjmkk030.xjmkk.d14162.t045155.leave
-rw-r--r--. 1 nlabra users 9348837 Jun 11 06:27 xjmkk031.xjmkk.d14162.t060522.leave
-rw-r--r--. 1 nlabra users 9347946 Jun 11 06:51 xjmkk032.xjmkk.d14162.t062731.leave
-rw-r--r--. 1 nlabra users  522953 Jun 11 07:01 xjmkk033.xjmkk.d14162.t065135.leave
-rw-r--r--. 1 nlabra users  208142 Jun 11 07:28 xjmkk034.xjmkk.d14162.t070140.leave

and looking at the elapsed times gives

xjmkk030.xjmkk.d14162.t045155.leave:Elapsed Time         : 0:25:07 (1507 seconds, 14% of limit)
xjmkk031.xjmkk.d14162.t060522.leave:Elapsed Time         : 0:22:10 (1330 seconds, 12% of limit)
xjmkk032.xjmkk.d14162.t062731.leave:Elapsed Time         : 0:24:02 (1442 seconds, 13% of limit)
xjmkk033.xjmkk.d14162.t065135.leave:Elapsed Time         : 0:10:04 (604 seconds, 6% of limit)
xjmkk034.xjmkk.d14162.t070140.leave:Elapsed Time         : 0:00:05 (5 seconds, 0% of limit)

so it appears that there has been some issue in step 33, but that the model thinks that it exited cleanly and can begin the next CRUN step.

The only dumps existing are:

xjmkka.das3910
xjmkka.das39b0

while the last ocean restart files are

xjmkko_28030830_restart_0002.nc
xjmkko_28030830_restart_0001.nc
xjmkko_28030830_restart_0000.nc

and the last ice restart file is

xjmkki.restart.2803-09-01-00000

so it looks like the model should be able to run from the 1st September dump, but the xjmkk.phist file has the following

 &NLCHISTG
 END_DUMPIM='xjmkka.das39b0', '              ', '              ', '              ', RESTARTIM='                                                                                ', '                                                                                ', '                                                                                ', '                                                                                ', SAFEDMPIM='xjmkka.das3810', '              ', '              ', '              ', NEWSAFEIM='xjmkka.das3910', '              ', '              ', '              ', LASTATMIM='              ', '              ', '              ', '              ', CURRATMIM='              ', '              ', '              ', '              ', LASTDMPIM='xjmkka.das38l0', '              ', '              ', '              '
 /

i.e. it has not been updated to use xjmkka.das3910 as the last safe dump, instead pointing to a dump file that no longer exists, and wouldn't match the ocean/ice restart files.

Doing an xxdiff between the 32nd and 33rd job-steps shows that the output to the fort6 files is missing from job-step 33, and this step is also missing the line STOP END OF OASIS SIMULATION near the top of the .leave file, which is present in all the other job-steps. Could this be significant?

I'm concerned that the model finished without error at the end of the 33rd job-step, and I'm not sure how to stop the model from doing this again. It seems to be a general error that has developed recently with this configuration, as previous jobs ran for 75+ years without issue on the phase2 machine. Could it be compiler related (I'm not sure when the last compiler change was made on MONSooN)?

I'm not sure if the component model causing the problems is the UM, NEMO/CICE, or OASIS itself.

Any advice as to how to get this job and other similar to it to run for more than a few years would be greatly appreciated.

Many thanks,
Luke

Change History (1)

comment:1 Changed 4 years ago by annette

  • Resolution set to completed
  • Status changed from new to closed

Peer and Luke found that the model failure was due to a NEMO crash at the previous job step. There was therefore a bug in the coupling infrastructure that meant that the model did not cease execution at this point. (Simple tests could not recreate this behaviour for a generic NEMO error.)

To paraphrase Peer and Luke:

Strongly increased salinity appeared in some grid cells in the Red Sea which caused flow velocities in the Red Sea to exceed the reasonable range. Combined with the CO2 forcing, this lead to an instability, and NEMO crash.

After investigation, a solution was found by resetting the salinity values (sn and sb) in the Red Sea region to 35 psu in the NEMO restart dumps. This doesn’t seem to affect the model evolution greatly, if at all.

Note: See TracTickets for help on using tickets.