Opened 7 years ago
Closed 6 years ago
#1311 closed help (completed)
Error in coupled HadGEM3-AO simulation: The latest NEMO restart dump does not seem to be consistent with the UM .phist file
Reported by: | luke | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | NEMO/CICE, OASIS |
Cc: | pjn | Platform: | MONSooN |
UM Version: | 7.3 |
Description
Hello,
Both Peer (pjn) and myself are having problems to get a HadGEM3-AO N48L60 simulation to run for more than a few years. The final error causing the job to crash is
ERROR: The latest NEMO restart dump does not seem to be consistent with the UM .phist file This suggests an untidy model failure but you may be able to retrieve the run by modifying UM history files or deleting the latest NEMO dumps as appropriate --------------------------------------------------------- qsexecute: problem executing NEMO setup script
An example job is xjmkk.
While this appears to be a similar problem as ticket #1212 I'm not sure if it is the same, as the error still occurs when there appears to be enough disk space remaining (either that, or space gets used then quickly freed-up again).
Looking through the .leave files, I think that the error may in fact be coming from the previous job-step. Looking at the file sizes of the final few files gives
-rw-r--r--. 1 nlabra users 9358264 Jun 11 06:05 xjmkk030.xjmkk.d14162.t045155.leave -rw-r--r--. 1 nlabra users 9348837 Jun 11 06:27 xjmkk031.xjmkk.d14162.t060522.leave -rw-r--r--. 1 nlabra users 9347946 Jun 11 06:51 xjmkk032.xjmkk.d14162.t062731.leave -rw-r--r--. 1 nlabra users 522953 Jun 11 07:01 xjmkk033.xjmkk.d14162.t065135.leave -rw-r--r--. 1 nlabra users 208142 Jun 11 07:28 xjmkk034.xjmkk.d14162.t070140.leave
and looking at the elapsed times gives
xjmkk030.xjmkk.d14162.t045155.leave:Elapsed Time : 0:25:07 (1507 seconds, 14% of limit) xjmkk031.xjmkk.d14162.t060522.leave:Elapsed Time : 0:22:10 (1330 seconds, 12% of limit) xjmkk032.xjmkk.d14162.t062731.leave:Elapsed Time : 0:24:02 (1442 seconds, 13% of limit) xjmkk033.xjmkk.d14162.t065135.leave:Elapsed Time : 0:10:04 (604 seconds, 6% of limit) xjmkk034.xjmkk.d14162.t070140.leave:Elapsed Time : 0:00:05 (5 seconds, 0% of limit)
so it appears that there has been some issue in step 33, but that the model thinks that it exited cleanly and can begin the next CRUN step.
The only dumps existing are:
xjmkka.das3910 xjmkka.das39b0
while the last ocean restart files are
xjmkko_28030830_restart_0002.nc xjmkko_28030830_restart_0001.nc xjmkko_28030830_restart_0000.nc
and the last ice restart file is
xjmkki.restart.2803-09-01-00000
so it looks like the model should be able to run from the 1st September dump, but the xjmkk.phist file has the following
&NLCHISTG END_DUMPIM='xjmkka.das39b0', ' ', ' ', ' ', RESTARTIM=' ', ' ', ' ', ' ', SAFEDMPIM='xjmkka.das3810', ' ', ' ', ' ', NEWSAFEIM='xjmkka.das3910', ' ', ' ', ' ', LASTATMIM=' ', ' ', ' ', ' ', CURRATMIM=' ', ' ', ' ', ' ', LASTDMPIM='xjmkka.das38l0', ' ', ' ', ' ' /
i.e. it has not been updated to use xjmkka.das3910 as the last safe dump, instead pointing to a dump file that no longer exists, and wouldn't match the ocean/ice restart files.
Doing an xxdiff between the 32nd and 33rd job-steps shows that the output to the fort6 files is missing from job-step 33, and this step is also missing the line STOP END OF OASIS SIMULATION near the top of the .leave file, which is present in all the other job-steps. Could this be significant?
I'm concerned that the model finished without error at the end of the 33rd job-step, and I'm not sure how to stop the model from doing this again. It seems to be a general error that has developed recently with this configuration, as previous jobs ran for 75+ years without issue on the phase2 machine. Could it be compiler related (I'm not sure when the last compiler change was made on MONSooN)?
I'm not sure if the component model causing the problems is the UM, NEMO/CICE, or OASIS itself.
Any advice as to how to get this job and other similar to it to run for more than a few years would be greatly appreciated.
Many thanks,
Luke
Change History (1)
comment:1 Changed 6 years ago by annette
- Resolution set to completed
- Status changed from new to closed
Peer and Luke found that the model failure was due to a NEMO crash at the previous job step. There was therefore a bug in the coupling infrastructure that meant that the model did not cease execution at this point. (Simple tests could not recreate this behaviour for a generic NEMO error.)
To paraphrase Peer and Luke:
Strongly increased salinity appeared in some grid cells in the Red Sea which caused flow velocities in the Red Sea to exceed the reasonable range. Combined with the CO2 forcing, this lead to an instability, and NEMO crash.
After investigation, a solution was found by resetting the salinity values (sn and sb) in the Red Sea region to 35 psu in the NEMO restart dumps. This doesn’t seem to affect the model evolution greatly, if at all.