Opened 4 years ago
Closed 4 years ago
#2204 closed help (fixed)
JULES run failure on Lotus
Reported by: | charlie | Owned by: | um_support |
---|---|---|---|
Component: | JULES | Keywords: | Lotus |
Cc: | Platform: | Other | |
UM Version: |
Description (last modified by willie)
Dear Ros,
Sorry to bother you yet again, but we are going back to my JULES suite now - the good news is that it is now running, or at least it was. It ran for about half an hour and generated its first spin-up dump, but then failed. It got as far as timestep 302, but then failed with the error below. This looks to me like a machine-specific failure rather than anything going wrong with my run, but what does it mean?
Thanks,
Charlie
—-
Environment variables set for netCDF Fortran bindings in /apps/libs/netCDF/intel14/fortran/4.2/ You will also need to link your code to a compatible netCDF C library in /apps/libs/netCDF/intel14/4.3.2/ [WARN] file:imogen.nml: skip missing optional source: namelist:imogen_anlg_vals_list [WARN] file:urban.nml: skip missing optional source: namelist:jules_urban_switches [WARN] file:prescribed_data.nml: skip missing optional source: namelist:jules_prescribed_dataset(:) [WARN] file:urban.nml: skip missing optional source: namelist:jules_urban2t_param [WARN] file:ancillaries.nml: skip missing optional source: namelist:jules_crop_props [WARN] file:ancillaries.nml: skip missing optional source: namelist:jules_irrig [WARN] file:crop_params.nml: skip missing optional source: namelist:jules_cropparm [WARN] file:urban.nml: skip missing optional source: namelist:urban_properties [WARN] file:imogen.nml: skip missing optional source: namelist:imogen_run_list [WARNING] required_vars_for_configuration: RFM river prognostics will be initialised to zero. [WARNING] init_ic: Provided variable 'rgrain' is not required, so will be ignored mpirun: propagating signal 12 User defined signal 2 MPI Application rank 0 killed before MPI_Finalize() with signal 12 Received signal ERR cylc (scheduler - 2017-06-13T10:48:13Z): CRITICAL Task job script received signal ERR at 2017-06-13T10:48:13Z cylc (scheduler - 2017-06-13T10:48:13Z): CRITICAL failed at 2017-06-13T10:48:13Z
(This is suite u-am232 running on Lotus)
Change History (8)
comment:1 Changed 4 years ago by charlie
comment:2 Changed 4 years ago by simon
Hi Charlie,
It appears that your job was killed by the queuing system after 15 minutes. Try editing /home/charlie/roses/u-am232/suite.rc to increase the -W value towards the end of the file from 00:15
Simon.
comment:3 Changed 4 years ago by charlie
Thanks very much Simon. What should I increase this to? And am I indeed using the right queue? At present, I'm using the par-multi queue which, according to the documentation at http://help.ceda.ac.uk/article/274-lotus-queues, is a medium priority queue with a maximum runtime of 48 hours. Is this the right one to use?
My JULES suite is currently set to run for 10 years which, when I was doing it on our own machines here, used to take about 2-3 days of real-time. So what should I change in my suite.rc to enable it to run for this long?
Charlie
comment:4 Changed 4 years ago by simon
A quick back of the envelope calculation gives a 3 day run time for a 10 year job at the rate of 300t/s per 15 minutes. This obviously wont fit into your current queue. As you appear to be running the serial version of JULES, I'd recommend the long-serial queue which is the only one to allow for >48 hours. I'd ask for 80 hours to be safe.
Change the -q value to long-serial and the -W to 80:00 at the end of your suite.rc
Btw, it's always a good idea to quit completely from rosie/ rose edit and restart after making any changes to your suit.rc
This is the lowest priority queue, however, so you may have to wait a bit longer for the job to be submitted.
Simon.
comment:5 Changed 4 years ago by charlie
Thanks Simon, that's excellent.
Do you know roughly what the average waiting time is for this queue, just so I have a rough idea?
Also, whilst I remember - do you know what happens about storage on JASMIN? At the moment, I have my JULES output going to my home directory on JASMIN, so is this the correct location and how much space do I have here? If it's clearly not going to be enough, what should I do? I already have applied for access to the JULES workspace, so should my output be going here instead?
Charlie
comment:6 Changed 4 years ago by simon
Hi,
I'm afraid I don't know anything about jasmin beyond what I read on their help pages. I think your questions may be better answered by jasmin support. But, yes, I suspect your homespace will be limited, and that you should use the JULES workspace for your output.
Simon
comment:7 Changed 4 years ago by charlie
Thanks Simon, will do. Thanks for all your help.
comment:8 Changed 4 years ago by willie
- Description modified (diff)
- Resolution set to fixed
- Status changed from new to closed
Dear all,
Further to this (which I originally sent to Ros several days ago, but I gather she is away this week): I have already contacted people at CEH, and they don't know the answer. They think it is a machine-specific error, rather than anything to do with my suite. I have also contacted JASMIN support. The trouble is, they may not necessarily have much knowledge about JULES, so might say it's a JULES error!
Please can someone help?
As Ros says that the bottom of her message, my suite is u-am232 and I am submitting it to JASMIN/Lotus, from PUMA.
Charlie