Opened 14 months ago

Closed 14 months ago

Last modified 14 months ago

#2597 closed help (fixed)

Mid Holocene suite "retrying"

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version:

Description

Hi,

My CMIP6 mid-Holocene production suite (u-ba469) is currently running well, but since 10 PM last night the postproc_nemo stage has been stuck on "retrying". Whenever this has happened before, it either means there is a glitch at the RDF end, or I am out of budget, or I am out of space. I have checked my budget and it is fine, and in terms of space although I have quite a lot (~33T), I know I have had more in the past so that can't be the problem. Can you possibly advise? Shall I just leave it retrying for another 24 hours, and see if it either succeeds or fails?

Charlie

Change History (7)

comment:1 Changed 14 months ago by ros

Hi Charlie,

No ticket is too small for the helpdesk. Have you looked in the job.err files for error messages for the failing postproc_nemo task on ARCHER/PUMA?

Cheers,
Ros.

comment:2 Changed 14 months ago by charlie

Hi Ros,

Okay, it has now failed. I have found the problem, but have no idea why it occurred. The error is as follows:

[FAIL]  grid-T Seasonal mean for Jun-Jul-Aug 2279 not possible as only got 2 file(s): 
	nemo_ba469o_1m_22790701-22790801_grid-T.nc, nemo_ba469o_1m_22790801-22790901_grid-T.nc
[FAIL] Terminating PostProc...
[FAIL] main_pp.py nemo # return-code=1
Received signal ERR
cylc (scheduler - 2018-09-02T22:20:15Z): CRITICAL Task job script received signal ERR at 2018-09-02T22:20:15Z
cylc (scheduler - 2018-09-02T22:20:15Z): CRITICAL failed at 2018-09-02T22:20:15Z

I don't understand this, however because the files appear to be all present and correct. If I look on the RDF, at/nerc/n02/n02/cjrw09/pacmedy.d/gc31n96orca1_mh.d/u-ba469/22790701T0000Z then unsurprisingly there are no nemo_ba469o_1m_2279*T.nc files, even though it seems to have carried on and created files for 2280.

If I look under /home/n02/n02/cjrw09/cylc-run/u-ba469/share/data/History_Data/NEMOhist however, all the relevant files appear to be there:

eslogin004:cjrw09$ ls nemo_ba469o_1m_2279*T.nc
nemo_ba469o_1m_22790701-22790801_grid-T.nc
nemo_ba469o_1m_22790801-22790901_grid-T.nc
nemo_ba469o_1m_22790901-22791001_grid-T.nc
nemo_ba469o_1m_22791001-22791101_grid-T.nc
nemo_ba469o_1m_22791101-22791201_grid-T.nc
nemo_ba469o_1m_22791201-22800101_grid-T.nc

As you can see, the 2 files required to do a seasonal mean are present and correct e.g. 22790701-22790801 anf 22790801-22790901. The month for June is not in this directory, but you wouldn't expect it to be because that should be in the previous cycle. And indeed it is - it has already been archived to the RDF, where it is /nerc/n02/n02/cjrw09/pacmedy.d/gc31n96orca1_mh.d/u-ba469/22790101T0000Z/nemo_ba469o_1m_22790501-22790601_grid-T.nc.

So what's gone wrong here? It's almost like the archiving stage of the previous cycle has completed before the postprocessing stage of the current cycle has finished, meaning it moved June over to the archive before creating the JJA seasonal mean. But why has this happened, given that has it worked perfectly well every other time so far? Most importantly, how can I resolve the problem and restart?

Thanks,

Charlie

comment:3 Changed 14 months ago by grenville

Charlie

I think the trouble reported in /home/n02/n02/cjrw09/cylc-run/u-ba469/log/job/22790701T0000Z/postproc_nemo/01/job.err has created the subsequent problem. The solution is to put nemo_ba469o_1m_22790601-22790701_grid-T.nc back into NEMOHist:

cp /home/n02/n02/cjrw09/cylc-run/u-ba469/share/data/History_Data/NEMOhist/archive_ready/nemo_ba469o_1m_22790601-22790701_grid-T.nc /home/n02/n02/cjrw09/cylc-run/u-ba469/share/data/History_Data/NEMOhist/nemo_ba469o_1m_22790601-22790701_grid-T.nc

Then restart the suite.

You have been very unlucky here — I have a query with ARCHER to see why it forgot your directory.

Grenville

comment:4 Changed 14 months ago by charlie

Okay, very many thanks. Why am I always the unlucky one?!

I have found that file - once I have put it back in, at what point do I restart from? Do I just do rose-suite --restart and then trigger the postproc_nemo stage (where it failed)? If I do this, though, what happens to the stages afterwards i.e. as I said in my previous email, the next cycle (2080) has already begun and the atmos_main succeeded and produced output? Will this be overwritten, or will it just returned to the 2079 postproc_nemo stage, then jump over the 2018 atmos_main stage and go straight to the 2080 postproc_nemo stage?

comment:5 Changed 14 months ago by grenville

rose-suite —restart and retrigger the postproc_nemo task. The suite should take care of itself.

comment:6 Changed 14 months ago by grenville

rose suite-run —restart that should be.

comment:7 Changed 14 months ago by charlie

  • Resolution set to fixed
  • Status changed from new to closed

Many thanks, it seems to have worked and is now carrying on with the next cycle.

Thanks again,

Charlie

Last edited 14 months ago by charlie (previous) (diff)
Note: See TracTickets for help on using tickets.