Opened 5 months ago

Closed 3 months ago

#3238 closed help (fixed)

Restarting from existing run

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7



Sorry to bother you this difficult time, but I have just encountered an error (with u-br871) that I haven't seen before:

[FAIL] Unable to find top restart files for this cycle. Must either have one, or as many as there are nemo processors (73)
[FAIL] Found 72 iceberg restart files
[FAIL] run_model # return-code=144
2020-03-30T13:39:42Z CRITICAL - failed/EXIT

This came after a simple continuation from the end of the previous run i.e. not a proper restart (involving all the usual restart files) but simply extending an existing run that had succeeded all the way to its natural end. I thought, perhaps incorrectly, that all I had to do here is extend the runtime within the GUI (Run initialisation) i.e. if it had already done 100 years, and I wanted another 100 years, I simply change this to 200 years. Then restart doing rose suite-run --restart. It is not correct? Whenever I have done this before, it has worked - so why now is it not finding the restart files? I certainly haven't moved them.



Change History (7)

comment:1 Changed 5 months ago by grenville


Ticket #3001 is a similar problem — any help?


comment:2 Changed 4 months ago by charlie

Thanks Grenville. That other ticket seems to imply that there are too many restart files, and the question was asked whether that means some should be deleted so that it finds the exact number. That question wasn't answered, however.

In my case, therefore, should I delete some of the restart files, and if so which ones? It is saying it has found 72 iceberg restart files, which indeed is correct, but the first line says that there are 73. So should I delete one and if so should this be just the individual PE file (so that there are 73), and if so which one (the first, last etc) or the rebuilt version?


comment:3 Changed 4 months ago by grenville


Try moving the rebuilt file (don't delete it)


comment:4 Changed 4 months ago by charlie

Hi Grenville,

Okay, I have now tried that, moving ~/cylc-run/u-br871/share/data/History_Data/NEMOhist/ (i.e. the rebuilt version, leaving the other 72 where they are) into my home directory and then rose suite-run --restart and then triggering the failed coupled task, but get the same error.

I wondered if the problem was not the above, but rather in CICEhist instead, but this only has a rebuilt version i.e. doesn't have 72 individual PE versions.

Have I misunderstood something?


comment:5 Changed 4 months ago by grenville

pl try moving out of /home/d05/cwilliams/cylc-run/u-br871/share/data/History_Data/NEMOhist too

comment:6 Changed 4 months ago by charlie

Yes, many thanks, that seems to have worked. So is this a systemic problem, or just a glitch that happens sometimes? As I said, I'm sure I have restarted an existing suite before, without having to remove/move certain files.


comment:7 Changed 3 months ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.