Opened 6 months ago

Closed 5 months ago

#2357 closed help (answered)

Restarting a suite

Reported by: s.varma13 Owned by: ros
Priority: normal Component: UM Model
Keywords: start dump Cc:
Platform: Monsoon2 UM Version: 10.8

Description

Hi, my suite u-as691 stopped. The final output file for the suite is meant to be January 2015 but it only reached June 2005 according to /projects/ukca-imp/suvar/cyclc-run/u-as691/share/data/History_Data. The last start dump file was 20050601T0000Z in /projects/ukca-imp/suvar/cyclc-run/u-as691/share/cycle so I used that as my Model Basis Time to restart my run. However the suite failed saying the following:

"????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 10
? Error from routine: INITTIME
? Error message:
? Mismatch between model_basis_time read from namelist and validity time read
? from dump fixed header.
?
Rank 321 [Wed Jan 10 12:34:11 2018] [c8-1c1s0n2] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 321
? model_basis_time = 2005 6 1 0 0 0
? fixhd validity time = 2005 9 1 0 0 0
?
? If this is intentional disable this check by setting all elements of
? namelist:nlstcall=model_basis_time to zero. Otherwise make adjustments to
? either the namelist or dump to ensure that these two values match."

I do not want to use 01092005 as this will miss out 2 months of my run. Also where is this start dump file? Does it exist as my run stopped in June 2005? Do I always need to start from the September of a year so in this case 01092004? However my cycle folder only contains files going back to 20050301T0000Z

Could you please let me know how to proceed and what I should be using as my model basis time if this happens in the future?

Many thanks

Sunil

Change History (11)

comment:1 Changed 6 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Sunil,

The suite originally stopped in June 2005 because it ran out of wallclock. See log.20171226T132034Z/job/20050601T0000Z/atmos_main/01/job.err (You'll need to untar/zip the log directory first if you want to take a look)

So you just needed to change the wallclock time and then do a rose suite-run --restart for it to pick up where it left off.

Now that you've tried to restart it is as new run you'll have to continue down that path.
You've changed the "Model Basis Time" but you haven't explicitly told it which start dump to use - I'd guess $ASTARTDIR/${RUNID}a.da${DATEC_DUMP} is not referencing the correct start dump. In um → namelist → model input & output → dumping and meaning set astart to be the dump you wish to start from.

Make sure you switch off the build and reconfiguration before starting the run with rose suite-run.

In future if your suite stops, look in the log files to see what's happened, make any changes if required, and then restart the run with rose suite-run --restart the suite will then pick up from where it left off.

Cheers,
Ros.

comment:2 Changed 6 months ago by s.varma13

Hi Ros, thanks you.

I thought it had stopped again because of UM maintenance - the communications with the postproc had timed out after several tries and this has stopped the suite. When that happened (twice now) the first time I just retriggered it and the seocnd time the suite had stopped running like this time and I reset the model basis time to the last file it outputted. It failed and suggested an earlier start dump time - I changed the model base time to that, pressed the triangle run button and it worked. This was in accordance with ticket #2331 and I did not switch off of build and recon either time.

I hope I did the ocrrect thing.

This time I already have the wallclock time as T3H. I thought that was the maximum.

Do I now just change the astart to /projects/ukca-imp/suvar/cyclc-run/u-as691/share/cycle/20050601T0000Z and switch off build UM and Run reconfiguration?

Could you also please let me know what rose suite- run —restart and rose suite-run is on rosie GUI. In the previous ticket, I was advised if the suite is not running then set the model basis time to 19910601T0000Z and just run the suite again. I just pressed the trianlgle in the GUI. Was that right?

Thank you.

Cheers

Sunil

comment:3 Changed 6 months ago by s.varma13

PS where are the archived dumps produced by the simulation sent to?

comment:4 Changed 6 months ago by ros

Hi Sunil,

I've taken a look at all your log files and can see that you started the run in 19880901 and it ran until 20011201 where it stopped. In this time you might have had to restart the suite, but you didn't start the suite anew from a new Model Basis Time so all is well for this section.

When you started the model again after the stop in 20011201 you did a new run from 20010901, but unfortunately having not switched off reconfiguration, it took the initial aj670a.da20080901_00 dump file, reconfigured it into as691.da20010901_00 and used that to run from (See log.20171226T132034Z/job/20010901T0000Z/recon/01/job.out). Which obviously isn't what you wanted. This didn't get flagged as a problem because you were starting from September which happens to be the same month as the initial dump so the validity date check didn't error.

This time you've tried to start in June and again run the reconfiguration but as the months now mismatch it's thrown an error.

I think you will now want to backup and rerun from the 20010601 dump which you will need to extract out of the MASS archive.

To get files out of MASS you'll need to use moo get something like:
moo get moose:crum/u-as691/ada.file/<dumpname>

In general once a suite is running you only need to reset the Model Basis Time and do a new run (making sure reconfiguration is turned off) if things have got mucked up and you need to go back to a previous cycle or rose suite-run --restart doesn't work.

If you click the play button (triangle) Rose will do a rose suite-run and start a new run (equivalent to NRUN in UMUI terms).

To do a rose suite-run --restart from the Rose GUI click the little down arrow to the right of the play button (triangle) and select "Run suite". A dialog box pops up for you to enter further options. Enter "—restart" and click "OK".

Regards,
Ros.

Last edited 6 months ago by ros (previous) (diff)

comment:5 Changed 6 months ago by s.varma13

Hi Ros,

thank you for this.

So just to confirm, I have moved
as691a.da20010601_00 to /projects/ukca-imp/suvar/as619.

I have set Model Base Time to 20010601T0000Z
I have set astart source to /projects/ukca-imp/suvar/as619/as691a.da20010601_00
I have switched off build and reconfigure.
I can now do rose suite-run —restart.

Is that correct?

Do all the new files (data and dump) generated override the ones which are currently in the moose archive?

Am I able to delete files on the Moose archive?

Many thanks again

Sunil

comment:6 Changed 6 months ago by s.varma13

Sorry Ros, I just tried to log into evmsrose to open Rosie and got the folowing error:

Could not chdir to home directory /home/suvar: No such file or directory
/usr/bin/xauth: error in locking authority file /home/suvar/.Xauthority

Could you please let me know how I should resolve this?

Many thanks

Sunil

comment:7 Changed 6 months ago by ros

Hi Sunil,

I have set Model Base Time to 20010601T0000Z
I have set astart source to /projects/ukca-imp/suvar/as619/as691a.da20010601_00
I have switched off build and reconfigure.

Yes this is correct, but then you need to do rose suite-run to start the run from this new date.

The new data that is generated will overwrite what is currently in the MASS archive. Only Met Office people can delete files from MASS.

Regarding login to exvmsrose, the Met Office are currently doing maintenance on Monsoon, the patching that they were doing this morning has taken longer than planned - I can't even log in to exvmsrose at the moment. If it still doesn't work once they have announced the completion of the work let us know.

Cheers,
Ros.

comment:8 Changed 6 months ago by s.varma13

Hi Ros

Thank you - that all worked and the suite is now running.

Just one question - should I pause my suite when there is maintenance being done to PUMA or Monsoon? And is the Monsoon maintenance every second Tuesday of the month?

Best wishes

Sunil

comment:9 Changed 6 months ago by s.varma13

PS is it ever a problem pausing the suite and then restarting it?

comment:10 Changed 6 months ago by ros

Hi Sunil,

You can stop the suite if you want, but it's not necessary to do so. If the maintenance is not affecting running suites then you will potentially be wasting time. If the cylc server is rebooted and thus the suite stopped you just need to restart it in the same way as if you've stopped it yourself.

Every Tuesday 9-11am is marked as a possible maintenance window on Monsoon. It does not necessarily mean there will be any maintenance done that will interrupt users or their running suites. Announcements that affect users will be made on the Monsoon Twiki and Yammer group.

Holding or stopping a suite and then releasing or restarting the suite should not cause any problems.

Cheers,
Ros.

comment:11 Changed 5 months ago by ros

  • Resolution set to answered
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.