Opened 2 years ago

Closed 22 months ago

#2204 closed help (fixed)

JULES run failure on Lotus

Reported by: charlie Owned by: um_support
Component: JULES Keywords: Lotus
Cc: Platform: Other
UM Version:

Description (last modified by willie)

Dear Ros,

Sorry to bother you yet again, but we are going back to my JULES suite now - the good news is that it is now running, or at least it was. It ran for about half an hour and generated its first spin-up dump, but then failed. It got as far as timestep 302, but then failed with the error below. This looks to me like a machine-specific failure rather than anything going wrong with my run, but what does it mean?

Thanks,

Charlie

—-

Environment variables set for netCDF Fortran bindings in
  /apps/libs/netCDF/intel14/fortran/4.2/
You will also need to link your code to a compatible netCDF C library in
  /apps/libs/netCDF/intel14/4.3.2/
[WARN] file:imogen.nml: skip missing optional source: namelist:imogen_anlg_vals_list
[WARN] file:urban.nml: skip missing optional source: namelist:jules_urban_switches
[WARN] file:prescribed_data.nml: skip missing optional source: namelist:jules_prescribed_dataset(:)
[WARN] file:urban.nml: skip missing optional source: namelist:jules_urban2t_param
[WARN] file:ancillaries.nml: skip missing optional source: namelist:jules_crop_props
[WARN] file:ancillaries.nml: skip missing optional source: namelist:jules_irrig
[WARN] file:crop_params.nml: skip missing optional source: namelist:jules_cropparm
[WARN] file:urban.nml: skip missing optional source: namelist:urban_properties
[WARN] file:imogen.nml: skip missing optional source: namelist:imogen_run_list
[WARNING] required_vars_for_configuration: RFM river prognostics will be initialised to zero.
[WARNING] init_ic: Provided variable 'rgrain' is not required, so will be ignored
mpirun: propagating signal 12
User defined signal 2
MPI Application rank 0 killed before MPI_Finalize() with signal 12
Received signal ERR
cylc (scheduler - 2017-06-13T10:48:13Z): CRITICAL Task job script received signal ERR at 2017-06-13T10:48:13Z
cylc (scheduler - 2017-06-13T10:48:13Z): CRITICAL failed at 2017-06-13T10:48:13Z

(This is suite u-am232 running on Lotus)

Change History (8)

comment:1 Changed 2 years ago by charlie

Dear all,

Further to this (which I originally sent to Ros several days ago, but I gather she is away this week): I have already contacted people at CEH, and they don't know the answer. They think it is a machine-specific error, rather than anything to do with my suite. I have also contacted JASMIN support. The trouble is, they may not necessarily have much knowledge about JULES, so might say it's a JULES error!

Please can someone help?

As Ros says that the bottom of her message, my suite is u-am232 and I am submitting it to JASMIN/Lotus, from PUMA.

Charlie

comment:2 Changed 2 years ago by simon

Hi Charlie,

It appears that your job was killed by the queuing system after 15 minutes. Try editing /home/charlie/roses/u-am232/suite.rc to increase the -W value towards the end of the file from 00:15

Simon.

comment:3 Changed 2 years ago by charlie

Thanks very much Simon. What should I increase this to? And am I indeed using the right queue? At present, I'm using the par-multi queue which, according to the documentation at http://help.ceda.ac.uk/article/274-lotus-queues, is a medium priority queue with a maximum runtime of 48 hours. Is this the right one to use?

My JULES suite is currently set to run for 10 years which, when I was doing it on our own machines here, used to take about 2-3 days of real-time. So what should I change in my suite.rc to enable it to run for this long?

Charlie

Last edited 2 years ago by charlie (previous) (diff)

comment:4 Changed 2 years ago by simon

A quick back of the envelope calculation gives a 3 day run time for a 10 year job at the rate of 300t/s per 15 minutes. This obviously wont fit into your current queue. As you appear to be running the serial version of JULES, I'd recommend the long-serial queue which is the only one to allow for >48 hours. I'd ask for 80 hours to be safe.

Change the -q value to long-serial and the -W to 80:00 at the end of your suite.rc

Btw, it's always a good idea to quit completely from rosie/ rose edit and restart after making any changes to your suit.rc

This is the lowest priority queue, however, so you may have to wait a bit longer for the job to be submitted.

Simon.

comment:5 Changed 2 years ago by charlie

Thanks Simon, that's excellent.

Do you know roughly what the average waiting time is for this queue, just so I have a rough idea?

Also, whilst I remember - do you know what happens about storage on JASMIN? At the moment, I have my JULES output going to my home directory on JASMIN, so is this the correct location and how much space do I have here? If it's clearly not going to be enough, what should I do? I already have applied for access to the JULES workspace, so should my output be going here instead?

Charlie

comment:6 Changed 2 years ago by simon

Hi,

I'm afraid I don't know anything about jasmin beyond what I read on their help pages. I think your questions may be better answered by jasmin support. But, yes, I suspect your homespace will be limited, and that you should use the JULES workspace for your output.

Simon

comment:7 Changed 2 years ago by charlie

Thanks Simon, will do. Thanks for all your help.

comment:8 Changed 22 months ago by willie

  • Description modified (diff)
  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.