Opened 5 years ago

Closed 5 years ago

#1473 closed help (fixed)

Job not running as long as expected

Reported by: csteadman Owned by: luke
Component: UKCA Keywords: ukca
Cc: luke Platform: MONSooN
UM Version: 7.3

Description

Hello,

I'm trying to understand why I'm not getting as many months of output as I would expect. I've run copies of nitrate-extended GLOMAP jobs on MONSooN and ARCHER, and even though the run length is 1 year, 4 months, and 2 days, I only get one pm (monthly-mean) file, for September 1999. I'd like to get ps and py files as well.

I've tried copying my jobs and using a new job ID on MONSooN, but I'm still getting the same output.

ON MONSooN, my job ID is xkyte (past runs are xkytc, and the original job from Graham Mann is xiupm).

When I look at the output in /nerc/ukca/clstea/xkyte/ and /projects/ukca/clstea/xkyte/ I see the following files which can be opened in xconv (or I think these are the only files that can be opened in xconv):
ls xkytea.*
xkytea.daj99b0 xkytea.daj9a10 xkytea.pej99l0 xkytea.pmj9sep
xkytea.daj99l0 xkytea.pej99b0 xkytea.pej9a10

I'd expect to see more monthly mean (pm) files. I've tried adjusting the resubmission settings, dumping frequency, archive location, adding additional diagnostics, changing the queue (prime, night, default), and the the job limit time (the wall clock doesn’t change — it stays at 3 hours).

However, I don't think the job time limit is the issue, as the job stops well before the limit. This is an excerpt from the most recent run's output file xkyte000.xkyte.d15041.t151559.archive

End of job report
Run at 2015-02-10 19:59:21 for jobstep mon001.172226.0
Submitted : 2015-02-10 15:37:15
Queued : 2015-02-10 15:37:15
Dispatched : 2015-02-10 17:45:17
Queued Time : 2:08:02 (7682 seconds)
Elapsed Time : 2:14:04 (8044 seconds, 74% of limit)
Wall Clock Limit : 3:00:00 (10800 seconds)

I've also run the job that was ported to ARCHER. My job is xkzpb (original ported job is xjnjn). The start date is 1 September 1999, and the run length is 4 months, 2 days. I get a similar set of output files:
claudia@eslogin005:/work/n02/n02/claudia/um/xkzpb> ls -ltr xkzpba.*
-rw-r—r— 1 claudia n02 4609523712 Feb 10 04:08 xkzpba.daj99b0
-rw-r—r— 1 claudia n02 4407296 Feb 10 04:08 xkzpba.pej99b0
-rw-r—r— 1 claudia n02 4609523712 Feb 10 04:43 xkzpba.daj99l0
-rw-r—r— 1 claudia n02 4407296 Feb 10 04:44 xkzpba.pej99l0
-rw-r—r— 1 claudia n02 4609523712 Feb 10 05:19 xkzpba.daj9a10
-rw-r—r— 1 claudia n02 2095579136 Feb 10 05:19 xkzpba.pmj9sep
-rw-r—r— 1 claudia n02 4407296 Feb 10 05:19 xkzpba.pej9a10

I've been able to adjust the job limit time in ARCHER, but I don't think that's the issue. Below is an excerpt from my .leave file xkzpb000.xkzpb.d15040.t095949.leave:
Resources requested: ncpus=192,place=free,walltime=12:00:00
Resources allocated: cpupercent=23,cput=00:30:00,mem=16244kb,ncpus=192,vmem=384996kb,walltime=02:08:02

In summary, for both jobs, I'd like to know why the only monthly-mean file is for September. I would like to run the jobs for over a year to get ps and py files.

Thanks for your help,
Claudia

Change History (5)

comment:1 Changed 5 years ago by luke

  • Owner changed from um_support to luke
  • Status changed from new to accepted

Hi Claudia,

Are you turning on the CRUN step? As far as I can see, you have it set up to run the NRUN step only - the resubmission is turned off.

The maximum queue length on MONSooN is 3 hours, so the model is set-up to run in chunks. You do the NRUN step and then turn the CRUN step on - this then allows the model to run until completion. You do the first step and then turn off compilation and turn on the CRUN hand-edit which is near the end of your hand-edits list:

~ros/HadGEM3-A/vn7.3/HGPKG1/crun.ed

Currently you have disabled the re-submission though, this needs to be turned back on. Looking at one of your c jobs you can't run 30-days in 3-hours - the original job ran 20-days per jobstep. You need to stick with this, or perhaps play around with the number of nodes to see if you can fit 30-days in the 3-hour queue length (10800s).

Graham's original job was set to run like this, although he didn't have archiving on. I would advise, on MONSooN, archiving the data to MOOSE as the job runs as this puts the least amount of pressure on the /projects and /nerc disks, which are currently around 90+% full all the time.

To archive to MOOSE you should turn off the

fcm:um-br/pkg/Share/VN7.3_HadGEM3-A_r2.0/src

branch and replace it with

fcm:um_br/dev/jeff/VN7.3_HadGEM3-A_r2.0_hector_monsoon_archiving/src

This will still give you the HadGEM3-A r2.0 configuration, but includes archiving on top of it. You also need to turn on the hand-edit

~jeff/umui_jobs/hand_edits/archiving_7.3

which you currently have turned off.

Then, in the

Post-processing
 -> Main Switch + General Questions

panel you need to change it so that it goes to

The New System (MOOSE)

with the Monsoon project group name set to ukca. You should also ensure that the superseeded files are deleted (the 3 radio buttons at the top).

I would suggest taking a copy of Graham's job again and making the changes, as that has the CRUN step set-up correctly.

Also, be careful of which output streams are set to archive. If you are sending everything to UPMEAN then you should be OK, but if you send things to UPA, UPB etc. then you need to make sure that these are set to archive. This is done in the final column of the

Post-processing
 -> Initialisation and processing of mean and standard PP files

panel. Everything set to N is not archived. This has caught me out before when a stream I was interested in was deleted!

Have a go with this and come back if you have any problems.

Thanks,
Luke

comment:2 Changed 5 years ago by grenville

Claudia

You only have 3 dumps in the output directory - with 10-day dumping, that's 30 days worth, which matches the monthly mean output.

You can look in the leave file (/projects/ukca/clstea/xkyte/xkyte.fort6.pe1 for example) and see that the model only ran for 2160 timesteps.

Grenville

comment:3 Changed 5 years ago by csteadman

Hi Luke,

Thanks. I took a copy of Graham's job xiupn and made the changes you suggested. My new job ID is xkytf. I turned the CRUN hand edit on, and turned resubmission back on again. (I had suspected the resubmission wasn't working, so I had turned it off — I thought I might be able to get at least a few months out at one time in one run, if the run could be longer than 3 hours.) I'd compared my job setup with some successful CRUN jobs, but they used UM vn 8.2 or 8.4, where there are CRUN options in the UMUI, and I didn't see those options in 7.3. I didn't realise there was a hand edit I needed to change.

Thank you also for your suggestions regarding archiving with MOOSE. I made those changes as well. One question about that: when I turn on this branch is there a revision number I should specify?

 fcm:um_br/dev/jeff/VN7.3_HadGEM3-A_r2.0_hector_monsoon_archiving/src

After making these changes, I tried submitting the job but received the following message:

You have selected a compilation step and a continuation run CRUN.
This is not allowed. Please modify your UMUI settings.
For quick fix set RCF_NEW_EXEC to false in SUBMIT file

In Compilation and Modifications → Compile options for the UM Model, I changed the radio button to "Compile and build the executable, then stop". (It was set to "Compile and build the executable, then run"). I tried saving, processing, and submitting again, but received the same message.

Thanks,
Claudia

comment:4 Changed 5 years ago by luke

Hi Claudia,

You'll need to do the CRUN step first. Turn off the CRUN hand-edit and then send for compilation and run. When the NRUN step finishes you can then send off the CRUN (turn off compilation, turn on hand-edit).

From what Grenville says above you could try for 1 month per step.

You don't need to specify a revision number, you can leave it blank and it will pick up the most recent revision. If you would rather have it in then you can take a look at the branch revision log here:

https://puma.nerc.ac.uk/trac/UM/log/UM/branches/dev/jeff/VN7.3_HadGEM3-A_r2.0_hector_monsoon_archiving

which gives the last revision as r14917 10 months ago.

Thanks,
Luke

comment:5 Changed 5 years ago by annette

  • Resolution set to fixed
  • Status changed from accepted to closed

Claudia,

I assume this solved your problem so am closing the ticket.

Best regards,
Annette

Note: See TracTickets for help on using tickets.