Opened 6 months ago

Closed 6 months ago

#3020 closed help (fixed)

Restart failed GC3.1 with UM-NEMO restart mismatch

Reported by: m.couldrey Owned by: um_support
Component: Rose/Cylc Keywords: restart
Cc: Platform: NEXCS
UM Version:

Description

My suite (u-bi909) running HadGEM3 GC3.1 stopped unexpectedly yesterday, I think because I ran out of quota space in my NEXCS working directory (/work/projects/nexcs-n02/macou). That directory now has a little over 500GB free now, but attempting to restart my suite results in a fail of the 'coupled' task with the following in job.err:
. Cycle time is 18881001

UM restart time is 18881021

[FAIL] top_controller: Mismatch in TOP restart file date 18881101 and NEMO restart file date 18881001
[FAIL] run_model # return-code=188
2019-09-24T12:18:44Z CRITICAL - failed/EXIT

I have tried stopping the suite (after active tasks have finished) then doing rose suite-run —restart in the roses/u-bi909 directory, but that just brings up the cylc gui, showing me that the suite is failed on this coupled task, and doesn't seem to make the suite retry the coupled job.

I guess that the model was not properly able to write all the necessary restart files when it crashed because the quota was full. I suppose I need to try and restart the run using previous restart files. Looking in
/home/d00/macou/cylc-run/u-bi909/share/data/History_Data
I see that there are a few collections of restart files for a couple of different time points. For example, it looks like there are NEMO and CICE restart files for the time 1888/10/01. It also looks like there are UM restart files too, but I'm not familiar with how they are organised or which files are needed to restart at a particular point.

Would it be possible to please get any guidance on how to get my suite running again? Is my interpretation correct that a restart from a previous cycle point is necessary? If so, how do I go about doing that?

Many thanks for any help with this
Matt

Change History (3)

comment:1 Changed 6 months ago by ros

Hi Matt,

Here are some instructions (https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral#RestartingFailingSuites) for how to restart failed coupled suites when start file dates get out of whack.

Cheers,
Ros.

comment:2 Changed 6 months ago by m.couldrey

Hi Ros

Thanks for that, I've bookmarked that page for future reference! I've followed the steps to remove any restarts later than 18881001 and retriggered the job. It's sat in the queue now, so hopefully it'll crack on again as expected and I'll update when it does.

Cheers!
Matt

comment:3 Changed 6 months ago by m.couldrey

  • Resolution set to fixed
  • Status changed from new to closed

Hi Ros
Looks like things have picked up nicely and the suite is rolling along again so I'll close the ticket. Thanks for your help!
Matt

Note: See TracTickets for help on using tickets.