Opened 10 days ago

Closed 8 days ago

#2907 closed help (fixed)

Restarting failed GC3.1

Reported by: m.couldrey Owned by: um_support
Component: Rose/Cylc Keywords: rose, HadGEM3, restarting
Cc: Platform: NEXCS
UM Version: 10.7

Description

Hi CMS

I recently noticed that my suite (bh-150) is no longer running as of the 8th May, and there is no longer any info in the cylc gui (I guess because whatever happened has timed out by now). I want to try and restart the suite from where it left off, but I've never done this successfully and want to be sure I'm doing it correctly.

I'm not totally sure why it stopped running, and I'm struggling to fault find. I've had a look in my cylc-run directory for clues but I'm not sure. This is as far as I've got by looking at the log/job/ directories. My suite is set up to cycle every 3 months:
18720101T0000Z: coupled succeeded, pptransfer successfully sent all files to jasmin

18720401T0000Z: coupled succeeded, pptransfer sent averaged output but not any restarts and the job.err shows:
2019-05-08T16:22:17Z WARNING - Message send failed, try 7 of 7: Cannot connect: https://xcslc1:43126/put_messages: <urlopen error [Errno 113] No route to host>
The pptransfer job.out doesn't say "fail" anywhere, it just seems to suggest everything went fine.

18720701T0000Z: coupled succeeded, postproc_atmos, _nemo, and _cice all seemed to succeed. But I suppose no pptransfer was attempted since the previous pptransfer did not succeed?

18721001T0000Z: coupled didn't succeed, probably timed out?

I guess pptransfer doesn't send restart files every cycle, only every year (since only the 18720101 cycle seems to have restart files on jasmin).

It seems like something timed out, possibly the pptransfer job for 18720401, since its job.err shows connection problems. I'm not sure whether or not that task completed with succeeded or was timed out. Tasks for subsequent cycles then got held up.

I came across these tips:
https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral

Based on those tips, and the fact that some of my cycles got further than others (e.g. coupled jobs ran for more cycles than pptransfer jobs), it seems like my best bet is to follow the "Restarting from Archived Restarts" tips and restart the run using the last restarts that got sent to jasmin: the 18720101 restarts. Most of those tips make sense but I had a couple of questions:

I'm a little confused about which restarts I should set with NEMO_START, NEMO_ICEBERGS_START and CICE_INIT. I see in my jasmin directory for the 18720101T0000Z cycle I have bh150o_18720101_restart.nc as well as bh150o_18711201_restart.nc, and similar pairs for the cice, passive tracer, and iceberg restarts. Does nemo-cice need sets of restart files (by a 'set' I mean a restart.nc, a restart_trc.nc, icebergs_restart.nc and bh150i.restart) for two different time points? Or does it only need the restart file with the latest date in the name?
I also see the only restart I have for the UM is bh150a.da18720101_00, is this the only one I need for the UM?

Are there any particular gotchas I should look out for? The tips mention that BITCOMP_NRUN should be set to TRUE (although it already was in my suite). The tips also mention that seasonal mean output might be affected when starting from january. I'm not sure if in my case I should start from 18720101 or 18710101?

Many thanks for your help!
Matt

Change History (6)

comment:1 Changed 10 days ago by dcase

If nothing has really failed, could you not just restart the suite, inspect the GUI, and retrigger from an appropriate point?

I may be being naive and not reading your question, but you can restart from the roses/suite-ID directory with rose suite-run --restart . I'm sure you know this, but I thought that I'd mention it in case stating the obvious turns out to be useful.

Dave

comment:2 Changed 10 days ago by dcase

Following a private conversation, the cheat-sheet is here:

http://metomi.github.io/rose/doc/html/cheat-sheet.html

and it suggests the superior command is rose suite-restart so as to restart from the run directory

comment:3 Changed 10 days ago by m.couldrey

Hi Dave

Running rose suite-restart reveals that my suite appears to be running?
[FAIL] Suite "u-bh150" appears to be running:
[FAIL] Contact info from: "/home/d00/macou/cylc-run/u-bh150/.service/contact"
[FAIL] CYLC_SUITE_HOST=xcslc1
[FAIL] CYLC_SUITE_OWNER=macou
[FAIL] CYLC_SUITE_PORT=43126
[FAIL] CYLC_SUITE_PROCESS=57411 /usr/bin/python2 /common/fcm/cylc-7.8.1/bin/cylc-run u-bh150 —host=localhost
[FAIL] Try "cylc stop 'u-bh150'" first?

Opening the cylc gui shows me no tasks, and "stopped with 2 failed tasks".
Not sure what I should do here…

comment:4 Changed 9 days ago by dcase

Commands to stop and restart are here (under Restarting Rose suites):

https://collab.metoffice.gov.uk/twiki/bin/view/Support/MONSooNRose

comment:5 Changed 8 days ago by m.couldrey

Hey Dave

Thanks for helping me out with this. I'm really glad I've followed your suggestions rather than trying to start from archived restarts!

I decided to try and reproduce the error I posted in comment 3 by hitting rose suite-run —restart in the suite directory, but instead of failing as before, it started up again and has been running fine since. ¯\_(ツ)_/¯

Looks like it's all working now, cheers!

comment:6 Changed 8 days ago by dcase

  • Resolution set to fixed
  • Status changed from new to closed

Ok. Well I'll close this ticket, but feel free to reopen it if problems pop up further down the line.

Note: See TracTickets for help on using tickets.