#2088 closed help (completed)

suite restarts from wrong month after shutdown

Reported by: marcus Owned by: ros
Priority: normal Component: UM Model
Keywords: Cc:
Platform: ARCHER UM Version: 10.4

Description

Hi,

I had to stop my model suite u-ag542 on ARCHER and when restarting it seems to want to start from the date one cycle prior to where it had finished. The problem is that I have already deleted the start dump from that date, and I'd like it to continue from where it had stopped.

More details:

In u-ag542 I set the cycling period to 6 months which gives me the best queue/run time ratio. 11 years into this run the model stopped in 19990301. When routinely checking up on the run by opening gcylc the GUI was frozen and indicated that the previous cycle (starting in 19980901) was still queueing.

When looking at the log files I found nothing that would indicate a problem and the model output in History_data seemed complete, all indicating a normal stop of the run after the cycle ended in 19990301.

Therefore I killed the frozen Python processes on Puma and resubmitted with rose rose suite-run --restart. It however seemed to want to start again from 19980301 (six months earlier), but now after the weekend gcylc does not indicate that there is this running suite any longer in the system, in the log/atmos_main directory is no subdirectory 02 for a second attempt of running this cycle, and cylc scan does not show that anything is running now.

Why has this suite crashed in the first place and how can I get it to restart from 19990301?

Many thanks,
Marcus

Change History (9)

comment:1 Changed 15 months ago by marcus

Typo in second last para: it starts from 19980901, not 19980301.

comment:2 Changed 15 months ago by ros

Hi Marcus,

The gcylc GUI can sometimes stop updating the task statuses (freeze) this does not necessarily mean the suite has crashed. The tasks may still be running on ARCHER and it might just need a shutdown and relaunch of gcylc. From the output files I can find it's impossible to diagnose exactly what happened.

It does indeed look like the 19980901 cycle completed successfully so you could try doing a warm start from 19990301 by running:

rose suite-run --restart -- --warm 19990301T0000Z

If that doesn't work then you will have to re-run from 19980901.

Regards,
Ros.

comment:3 Changed 15 months ago by marcus

Hi Ros,

Thank you for helping me. I have tried this now but I'm getting an error message related to the --warm option as far as I can discern. Did I misunderstand something?

Many thanks,
Marcus

marcus@puma:/home/marcus/roses/u-ag542> rose suite-run --restart -- --warm 19990301T0000Z
[INFO] delete: /home/marcus/.cylc/ports/u-ag542
[INFO] delete: log/rose-suite-run.conf
[INFO] symlink: rose-conf/20170220T133138-restart.conf <= log/rose-suite-run.conf
[INFO] delete: log/rose-suite-run.version
[INFO] symlink: rose-conf/20170220T133138-restart.version <= log/rose-suite-run.version
[INFO] export CYLC_VERSION=6.11.2
[INFO] export ROSE_ORIG_HOST=puma
[INFO] export ROSE_VERSION=2016.11.1
[INFO] WARNING: deprecated items were automatically upgraded in 'suite definition':
[INFO]  * (6.11.0) [cylc][event hooks][timeout handler] -> [cylc][events][timeout handler] - value unchanged
[INFO]  * (6.11.0) [cylc][event hooks][shutdown handler] -> [cylc][events][shutdown handler] - value unchanged
[INFO]  * (6.11.0) [cylc][event hooks][timeout] -> [cylc][events][timeout] - value unchanged
[INFO]  * (6.11.0) [cylc][event hooks] - value unchanged
[INFO]  * (6.11.0) [runtime][NCAS_NOT_SUPPORTED][job submission][method] -> [runtime][NCAS_NOT_SUPPORTED][job][batch system] - value unchanged
[INFO]  * (6.11.0) [runtime][LINUX][job submission][method] -> [runtime][LINUX][job][batch system] - value unchanged
[INFO]  * (6.11.0) [runtime][HPC][job submission][method] -> [runtime][HPC][job][batch system] - value unchanged
[INFO]  * (6.11.0) [runtime][SUBMIT_RETRIES][job submission][retry delays] -> [runtime][SUBMIT_RETRIES][job][submission retry delays] - value unchanged
[INFO]  * (6.11.0) [runtime][RETRIES][retry delays] -> [runtime][RETRIES][job][execution retry delays] - value unchanged
[INFO] 2017-02-20T13:31:41Z WARNING - task "NCAS_NOT_SUPPORTED" not used in the graph.
[INFO] 2017-02-20T13:31:41Z WARNING - task "RETRIES" not used in the graph.
[INFO] chdir: log/
[INFO] u-ag542: will restart on localhost
[FAIL] cylc restart u-ag542 --warm 19990301T0000Z # return-code=2, stderr=
[FAIL] Usage: cylc [control] restart [OPTIONS] REG 
[FAIL] 
[FAIL] Start a suite run from the previous state. To start from scratch (cold or warm
[FAIL] start) see the 'cylc run' command.
[FAIL] 
[FAIL] The scheduler runs in daemon mode unless you specify n/--no-detach or --debug.
[FAIL] 
[FAIL] Tasks recorded as submitted or running are polled at start-up to determine what
[FAIL] happened to them while the suite was down.
[FAIL] 
[FAIL] Arguments:
[FAIL]    REG               Suite name
[FAIL] 
[FAIL] cylc-restart: error: no such option: --warm
marcus@puma:/home/marcus/roses/u-ag542>

comment:4 Changed 15 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Marcus,

Sorry my fault that should have said:

rose suite-run -- --warm 19990301T0000Z

Cheers,
Ros.

comment:5 Changed 15 months ago by marcus

Thank you, Ros, this has worked i.e. I could submit the job successfully so far.

Best regards,
Marcus

comment:6 Changed 15 months ago by marcus

Hi Ros,

It's working fine now, thank you.

Can I just ask for better understanding of what we did, please what is the difference between
rose suite-run --restart and rose suite-run -- --warm <date>?

I have been looking for details on this in http://collab.metoffice.gov.uk/twiki/bin/view/Support/HowToRunTheUMInRose but couldn't find anything. Are there any other online tutorials that would cover this?

Many thanks,
Marcus

comment:7 Changed 15 months ago by ros

Hi Marcus,

rose suite-run --restart

This restarts a suite from a previous state. This allows restarting a suite that was shut down or killed, without rerunning tasks that were already completed, or which were already submitted or running when the suite went down. It allows restarting of a suite from where it left off and is the best way to restart a suite.

rose suite-run -- --warm <cycle point>

This is called a warm start and runs a suite from scratch from a given cycle point that is later than the suite’s initial cycle point. All tasks from the given cycle point will run - it doesn't do any checking on the state of previously run tasks, thus it may result in some tasks rerunning. A warm start should is only required if a restart is not possible.

You can find more information on restarts, warm starts, etc in the cylc documentation: https://cylc.github.io/cylc/documentation.html

Hope that helps.

Regards,
Ros.

comment:8 Changed 15 months ago by marcus

This is great, thank you very much!

Best regards,
Marcus

comment:9 Changed 14 months ago by ros

  • Resolution set to completed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.