Opened 13 months ago

Closed 13 months ago

Last modified 12 months ago

#2601 closed help (fixed)

Model stuck on "retrying" again

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7

Description

Hi,

Following on from ticket #2597, I think I might have encountered another example of where the model appears to be falling over itself, by running ahead of itself and then generating an error saying it can't find the right directory - simply because that directory hasn't been created yet, and isn't meant to have been.

This time, the suite in question is u-ay314, which is running with 3 year cycling. The postprocessing stage of 20330901T0000Z is currently running, as is the atmos_main stage of the next cycle 20360901T0000Z. This is all well and good. But there is now a middle cycle, 20350901T0000Z, in which the RUN_MAIN is waiting but the postproc is stuck on retrying. I have looked at the error log, and it is giving me:

[FAIL]  check_directory: Exiting - Directory does not exist: /home/d05/cwilliams/cylc-run/u-ay314/work/20350901T0000Z/atmos_main
[FAIL] Terminating PostProc...
[FAIL] main_pp.py atmos # return-code=1
2018-09-06T10:32:49Z CRITICAL - failed/EXIT

This is absolutely correct, it doesn't exist. But that's because it shouldn't exist - given that I am using 3 year cycling, it should go straight from 2033 to 2036 (which indeed it is). So why has 2035 suddenly appeared, and why is the postproc trying to look for a directory which shouldn't be there?

Should I just ignore this and let it carry on with 2036 (which is correct), or do I need to do something about this?

Many thanks,

Charlie

Change History (7)

comment:1 Changed 13 months ago by charlie

Hi again,

Further to the above, I think my suite has worked and has successfully run to completion, despite the above problem and despite that 2035 postproc stage still being stuck on "retrying". In other words, the 2033 cycle appears to have run successfully, and has archived all of its 3 years onto JASMIN. Likewise the following cycle, 2036, again appears to have run successfully and has also archived everything to JASMIN. Given that the suite was always meant to end in August 2038, this is correct - I have checked everything on JASMIN, and all output up to August 2038 is present and correct. The actual suite, however, hasn't naturally stopped, because it's still trying to retry this weird 2035 postproc stage.

I'm guessing I can just shut down and kill the suite, given that everything I need appears to have been archived? But that still doesn't explain why the problem (which turned out not to be a problem) happened in the first place?

Charlie

comment:2 Changed 13 months ago by grenville

Hi Charlie

We were waiting a little while to hear the outcome, which is as expected - the valid parts of the suite ran OK and the odd cycle is left in limbo. You can shut down the suite.

As to why the 2035 postproc task appeared - regrettably, we don't know.

Grenville

comment:3 Changed 13 months ago by charlie

Okay, thanks. Just another unsolved mystery with the UM?

comment:4 Changed 13 months ago by charlie

  • Resolution set to fixed
  • Status changed from new to closed

comment:5 Changed 12 months ago by charlie

Hi guys,

I appreciate I have already closed this ticket, however thought you might like to go that ALL of my other suites are also displaying the same error as above. In other words, as I said before I was running 4 suites at once: u-ay314, u-ba408, u-ba436 and u-ba437 - the first has finished, and the others are very nearly finished. However, again the last 3 are all displaying a weird 2035 cycle, which is again simply stuck on "retrying", which shouldn't be there in the first place. Just like last time, the 2036 cycle (which is the last one) is running and is looking like it will complete perfectly. But the 2035 cycle is still there, so this issue is clearly not a one-off!

Charlie

comment:6 Changed 12 months ago by ros

Hi Charlie,

I took another look at this and realised what's going on…. There is a specification in the suite.rc file for the suite to run a specific task rose_arch_wallclock only in the final cycle. Unfortunately the calculation to determine the final cycle time only works if the run length is a multiple of the cycle length. In your setup this was not the case so it incorrectly identifies the final cycle as 2035.

start year (1988) + run length (50 years) - cycle length (3) ⇒ 2035

Mystery solved!
Cheers,
Ros.

comment:7 Changed 12 months ago by charlie

Okay, understood, many thanks.

Note: See TracTickets for help on using tickets.