Opened 6 years ago

Closed 6 years ago

#1212 closed help (fixed)

Coupled model resubmission errors

Reported by: laurahb Owned by: annette
Component: UM Model Keywords: coupled model, hadgem3, resubmission
Cc: Platform: MONSooN
UM Version: 8.2

Description (last modified by annette)

Reported by user:

On MONSooN model runs have failed due to time-outs or lack of disk-space. Resubmission of CRUN sometimes fails with this error:

---------------------------------------------------------

ERROR: The latest NEMO restart dump does not seem to be
       consistent with the UM .xhist file
       This suggests an untidy model failure but you
       may be able to retrieve the run by copying the
       backup dump and xhist files to original location
       and restarting with the appropriate NEMO dump
---------------------------------------------------------

It is complicated by the fact that archiving deletes intermediate dumps so starting a new NRUN means retrieving dumps from up to 6 months ago.

Change History (4)

comment:1 Changed 6 years ago by annette

When restarting from a failure, the coupled model components first work out how long they have run for and so which dump to restart from:

  • the atmosphere goes by what is written in the xhist file
  • the ocean model uses the last dump written
  • the seaice model uses the filename written in ice.restart_file

Assume a model set up to run in monthly chunks with 10 day dumps (as in xixaq). Say the model fails after writing the dumps at day 30 of the month but before updating the xhist file. In this case it will not restart correctly. The ocean/sea-ice want to use the day 30 dumps, but the atmos wants to use the day 21 dumps.

I think that in this case it is probably safest to go back to the day 21 dumps, by deleting the ocean day 30 dumps and editing ice.restart_file. Archiving shouldn't delete the atmos day 21 dumps before updating xhist (I hope) and this ensures that none of the day 30 atmos diagnostic output is missed.

Does this make sense? Hopefully it tallies with your experiences.

It is possible also to edit the xhist file to restart from the atmos dump you want, but that is more complicated… I am having a look at what happens when you do this to make sure it doesn't mess anything up.

Annette

comment:2 Changed 6 years ago by annette

  • Description modified (diff)

comment:3 Changed 6 years ago by annette

Hi Laura,

I've got into a bit of a muddle editing the xhist file - it works sometimes but not others… so I wouldn't recommend it.

As the error message says if you do have xhist files archived alongside the dumps then that could work. I don't have that in my test job but it doesn't use MOOSE archiving.

I think though, that you should normally be in the situation described earlier, so you shouldn't need to worry about the xhist file.

Annette

comment:4 Changed 6 years ago by annette

  • Resolution set to fixed
  • Status changed from new to closed

Hi Laura,

I'm closing this ticket now, but if you still have issues you can reopen it or create a new ticket.

Annette

Note: See TracTickets for help on using tickets.