Opened 2 months ago

Closed 4 weeks ago

#3238 closed help (fixed)

Restarting from existing run

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7

Description

Hi,

Sorry to bother you this difficult time, but I have just encountered an error (with u-br871) that I haven't seen before:

[FAIL] Unable to find top restart files for this cycle. Must either have one, or as many as there are nemo processors (73)
[FAIL] Found 72 iceberg restart files
[FAIL] run_model # return-code=144
2020-03-30T13:39:42Z CRITICAL - failed/EXIT

This came after a simple continuation from the end of the previous run i.e. not a proper restart (involving all the usual restart files) but simply extending an existing run that had succeeded all the way to its natural end. I thought, perhaps incorrectly, that all I had to do here is extend the runtime within the GUI (Run initialisation) i.e. if it had already done 100 years, and I wanted another 100 years, I simply change this to 200 years. Then restart doing rose suite-run --restart. It is not correct? Whenever I have done this before, it has worked - so why now is it not finding the restart files? I certainly haven't moved them.

Thanks,

Charlie

Change History (7)

comment:1 Changed 2 months ago by grenville

Charlie

Ticket #3001 is a similar problem — any help?

Grenville

comment:2 Changed 2 months ago by charlie

Thanks Grenville. That other ticket seems to imply that there are too many restart files, and the question was asked whether that means some should be deleted so that it finds the exact number. That question wasn't answered, however.

In my case, therefore, should I delete some of the restart files, and if so which ones? It is saying it has found 72 iceberg restart files, which indeed is correct, but the first line says that there are 73. So should I delete one and if so should this be just the individual PE file (so that there are 73), and if so which one (the first, last etc) or the rebuilt version?

Charlie

comment:3 Changed 2 months ago by grenville

Charlie

Try moving the rebuilt file (don't delete it)

Grenville

comment:4 Changed 2 months ago by charlie

Hi Grenville,

Okay, I have now tried that, moving ~/cylc-run/u-br871/share/data/History_Data/NEMOhist/br871o_icebergs_20350101_restart.nc (i.e. the rebuilt version, leaving the other 72 where they are) into my home directory and then rose suite-run --restart and then triggering the failed coupled task, but get the same error.

I wondered if the problem was not the above, but rather br871i.restart.2035-01-01-00000.nc in CICEhist instead, but this only has a rebuilt version i.e. doesn't have 72 individual PE versions.

Have I misunderstood something?

Charlie

comment:5 Changed 2 months ago by grenville

pl try moving br871o_20350101_restart_trc.nc out of /home/d05/cwilliams/cylc-run/u-br871/share/data/History_Data/NEMOhist too

comment:6 Changed 2 months ago by charlie

Yes, many thanks, that seems to have worked. So is this a systemic problem, or just a glitch that happens sometimes? As I said, I'm sure I have restarted an existing suite before, without having to remove/move certain files.

Charlie

comment:7 Changed 4 weeks ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.