Opened 4 years ago

Closed 4 years ago

#1766 closed error (fixed)

UM-UKCA job that used to work now fails

Reported by: mk1812 Owned by: ros
Component: UM Model Keywords: UKCA
Cc: Platform: MONSooN
UM Version: 8.4

Description

Hi,

I have a UM-UKCA job (xlyta) which I created in late October and which at the time ran successfully. The job is just a copy of the UKCA Cray release job xlsjc with minor changes (fossil-fuel BC + OC emissions turned off, and some changes to STASH diagnostics). I ran this job for 1 model year and it was fine, taking 2hr 15min per month. (For instance the .leave file for the first (NRUN) month can be found at /home/mkasoa/output/xlyta000.xlyta.d15297.t154221.leave, and the other output files in /projects/ukca-imp/mkasoa/xlyta/).

As a result of some errors colleagues of mine were experiencing with UKCA jobs they are trying to set up (ticket #1765) I decided to check that my job still worked. I took an exact copy of my xlyta job (including same start dump and date, and compiling a new executable), named xlytb. This job should therefore bit-compare with the original, as far as I understand. Instead though, it crashed after 11 minutes/2 model days into the first month, apparently with an error in one of the convection routines. See the .leave file at /home/mkasoa/output/xlytb000.xlytb.d15341.t120121.leave, and the processor output files in /projects/ukca-imp/mkasoa/xlytb_20min_tstep/.

Just to see what would happen (and replicate the steps taken by my colleague in ticket #1765) I tried reducing the model timestep from 20 minutes to 15 minutes, and resubmitted the job. This time, instead of crashing straight away, the job apparently ran for 18 model days, but then hit the 3-hour wallclock timelimit and was killed. See the .leave file at /home/mkasoa/output/xlytb000.xlytb.d15341.t135638.leave, and the processor output files in /projects/ukca-imp/mkasoa/xlytb_15min_tstep/. Given that the original job (xlyta) ran 30 days in 2.25 hours, clearly it should not now be hitting the 3 hour limit after 18 days after only a 25% increase in the timestep frequency (quite aside from the fact that it used to work fine with a 20 min timestep, and now doesn't…)

We're completely stumped by this, but all the UKCA jobs that my colleagues are running in the group here seem to be hitting similar errors, either failing very quickly, or running much slower than they should do and hitting the wallclock limit. In this case I know that xlyta worked fine as of 25th October, and xlytb is an exact copy of it. There also don't appear to have been any updates to the UKCA release job xlsjc since I created my original xlyta from it.

Any ideas as to what's going on would be greatly appreciated!

Best,
Matt

Change History (4)

comment:1 Changed 4 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Matt,

We have put a fix into the UMUI which should solve the slow running of your jobs. Please try running your job again and let us know how you get on.

If you are not recompiling your model executable you will need to go into the UMUI to window Compilation and Run options → UM Scripts Build and switch on "Enable build of UM scripts". Save, Process and Submit as usual. This only needs to be done once and can be switched off for subsequent submissions. Please also make sure that you are not specifying a revision number for the branch fcm:um-br/pkg/Config/vn8.4_ncas in order to pick up the new changes.

Regards,
Ros.

Note for helpdesk: The fix adds export OMP_NUM_THREADS, calculates NTASKS_PER_NODE and specifies -N option to aprun command.

comment:2 Changed 4 years ago by ros

  • Status changed from accepted to pending

comment:3 Changed 4 years ago by mk1812

Hi Ros,

Many thanks. I have re-tested my job and this fix has indeed resolved the problem. The job now runs successfully exactly as before, and produces identical output with a 20 min timestep.

Thank you for looking into this for us.

Best,
Matt

comment:4 Changed 4 years ago by ros

  • Resolution set to fixed
  • Status changed from pending to closed

That's great. Thanks for letting us know.

Cheers,
Ros.

Note: See TracTickets for help on using tickets.