Opened 4 years ago

Closed 4 years ago

#2088 closed help (completed)

suite restarts from wrong month after shutdown

Reported by: marcus Owned by: ros
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 10.4



I had to stop my model suite u-ag542 on ARCHER and when restarting it seems to want to start from the date one cycle prior to where it had finished. The problem is that I have already deleted the start dump from that date, and I'd like it to continue from where it had stopped.

More details:

In u-ag542 I set the cycling period to 6 months which gives me the best queue/run time ratio. 11 years into this run the model stopped in 19990301. When routinely checking up on the run by opening gcylc the GUI was frozen and indicated that the previous cycle (starting in 19980901) was still queueing.

When looking at the log files I found nothing that would indicate a problem and the model output in History_data seemed complete, all indicating a normal stop of the run after the cycle ended in 19990301.

Therefore I killed the frozen Python processes on Puma and resubmitted with rose rose suite-run --restart. It however seemed to want to start again from 19980301 (six months earlier), but now after the weekend gcylc does not indicate that there is this running suite any longer in the system, in the log/atmos_main directory is no subdirectory 02 for a second attempt of running this cycle, and cylc scan does not show that anything is running now.

Why has this suite crashed in the first place and how can I get it to restart from 19990301?

Many thanks,

Change History (9)

comment:1 Changed 4 years ago by marcus

Typo in second last para: it starts from 19980901, not 19980301.

comment:2 Changed 4 years ago by ros

Hi Marcus,

The gcylc GUI can sometimes stop updating the task statuses (freeze) this does not necessarily mean the suite has crashed. The tasks may still be running on ARCHER and it might just need a shutdown and relaunch of gcylc. From the output files I can find it's impossible to diagnose exactly what happened.

It does indeed look like the 19980901 cycle completed successfully so you could try doing a warm start from 19990301 by running:

rose suite-run --restart -- --warm 19990301T0000Z

If that doesn't work then you will have to re-run from 19980901.


comment:3 Changed 4 years ago by marcus

Hi Ros,

Thank you for helping me. I have tried this now but I'm getting an error message related to the --warm option as far as I can discern. Did I misunderstand something?

Many thanks,

marcus@puma:/home/marcus/roses/u-ag542> rose suite-run --restart -- --warm 19990301T0000Z
[INFO] delete: /home/marcus/.cylc/ports/u-ag542
[INFO] delete: log/rose-suite-run.conf
[INFO] symlink: rose-conf/20170220T133138-restart.conf <= log/rose-suite-run.conf
[INFO] delete: log/rose-suite-run.version
[INFO] symlink: rose-conf/20170220T133138-restart.version <= log/rose-suite-run.version
[INFO] export CYLC_VERSION=6.11.2
[INFO] export ROSE_ORIG_HOST=puma
[INFO] export ROSE_VERSION=2016.11.1
[INFO] WARNING: deprecated items were automatically upgraded in 'suite definition':
[INFO]  * (6.11.0) [cylc][event hooks][timeout handler] -> [cylc][events][timeout handler] - value unchanged
[INFO]  * (6.11.0) [cylc][event hooks][shutdown handler] -> [cylc][events][shutdown handler] - value unchanged
[INFO]  * (6.11.0) [cylc][event hooks][timeout] -> [cylc][events][timeout] - value unchanged
[INFO]  * (6.11.0) [cylc][event hooks] - value unchanged
[INFO]  * (6.11.0) [runtime][NCAS_NOT_SUPPORTED][job submission][method] -> [runtime][NCAS_NOT_SUPPORTED][job][batch system] - value unchanged
[INFO]  * (6.11.0) [runtime][LINUX][job submission][method] -> [runtime][LINUX][job][batch system] - value unchanged
[INFO]  * (6.11.0) [runtime][HPC][job submission][method] -> [runtime][HPC][job][batch system] - value unchanged
[INFO]  * (6.11.0) [runtime][SUBMIT_RETRIES][job submission][retry delays] -> [runtime][SUBMIT_RETRIES][job][submission retry delays] - value unchanged
[INFO]  * (6.11.0) [runtime][RETRIES][retry delays] -> [runtime][RETRIES][job][execution retry delays] - value unchanged
[INFO] 2017-02-20T13:31:41Z WARNING - task "NCAS_NOT_SUPPORTED" not used in the graph.
[INFO] 2017-02-20T13:31:41Z WARNING - task "RETRIES" not used in the graph.
[INFO] chdir: log/
[INFO] u-ag542: will restart on localhost
[FAIL] cylc restart u-ag542 --warm 19990301T0000Z # return-code=2, stderr=
[FAIL] Usage: cylc [control] restart [OPTIONS] REG 
[FAIL] Start a suite run from the previous state. To start from scratch (cold or warm
[FAIL] start) see the 'cylc run' command.
[FAIL] The scheduler runs in daemon mode unless you specify n/--no-detach or --debug.
[FAIL] Tasks recorded as submitted or running are polled at start-up to determine what
[FAIL] happened to them while the suite was down.
[FAIL] Arguments:
[FAIL]    REG               Suite name
[FAIL] cylc-restart: error: no such option: --warm

comment:4 Changed 4 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Marcus,

Sorry my fault that should have said:

rose suite-run -- --warm 19990301T0000Z


comment:5 Changed 4 years ago by marcus

Thank you, Ros, this has worked i.e. I could submit the job successfully so far.

Best regards,

comment:6 Changed 4 years ago by marcus

Hi Ros,

It's working fine now, thank you.

Can I just ask for better understanding of what we did, please what is the difference between
rose suite-run --restart and rose suite-run -- --warm <date>?

I have been looking for details on this in but couldn't find anything. Are there any other online tutorials that would cover this?

Many thanks,

comment:7 Changed 4 years ago by ros

Hi Marcus,

rose suite-run --restart

This restarts a suite from a previous state. This allows restarting a suite that was shut down or killed, without rerunning tasks that were already completed, or which were already submitted or running when the suite went down. It allows restarting of a suite from where it left off and is the best way to restart a suite.

rose suite-run -- --warm <cycle point>

This is called a warm start and runs a suite from scratch from a given cycle point that is later than the suite’s initial cycle point. All tasks from the given cycle point will run - it doesn't do any checking on the state of previously run tasks, thus it may result in some tasks rerunning. A warm start should is only required if a restart is not possible.

You can find more information on restarts, warm starts, etc in the cylc documentation:

Hope that helps.


comment:8 Changed 4 years ago by marcus

This is great, thank you very much!

Best regards,

comment:9 Changed 4 years ago by ros

  • Resolution set to completed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.