Opened 2 months ago

Last modified 2 months ago

#2083 new help

series of simulations differeing by emissions but only some exceeding walltime limit

Reported by: s1374103 Owned by: um_support
Priority: high Component: UKCA
Keywords: walltime Cc:
Platform: MONSooN UM Version: 8.4

Description

Hi Helpdesk,

I am using vn8.4 HadGEM3 GA4.0 UKCA CheST+GLOMAP-mode RJ4.0 on monsoon (xlsjc).

I have added a some new tracers, chemical reactions, emissions and nudged towards ERA-Interim.

At this point, the walltime limit was set to the maximum (3 hours) and the model was exceeding this, so I changed the processors.Since then I have been carrying out loads of simulations and the wallclock has not been an issue.

A few weeks ago I decided to do a series of simulations ( 2 years each) where I change just the emissions. For these series of simulations I am going to describe, note that the code/diagnostics/UMUO setup is identical, just that they are pointing towards different ancillary files (with differences in emissions)

The first job ran for 2 years perfectly fine - xncae

However, xncak, xncal are all failing due to exceeding the wallclcock limit.

The job is set to run in 1 month chunks, and it has usually been doing it in around 2 hours 20 minutes. So I'm wandering why now is it only getting to day 20 in 3 hours.

At first I thought it could be random, which is why I've submitted multiple jobs but they are all failing at the same time (around 20 days).

In addition to this, xncah failed with a different error which I cannot undestand…

sys-108 : FATAL error closing unit 6 during program termination 
  

sys-108 : FATAL error closing unit 6 during program termination 
  

sys-5 : UNRECOVERABLE error on system request 
  Input/output error

Encountered during an I/O operation on unit 6
Fortran unit 6 is connected to a sequential formatted text file:
  "/projects/ukca-ed/jakel/xncah/pe_output/xncah.fort6.pe74"
basename: missing operand
Try `basename --help' for more information.
basename: missing operand
Try `basename --help' for more information.

ATP Stack walkback for Rank 94 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  main@flumeMain.f90:48
  um_shell_@um_shell.f90:1865
  u_model_@u_model.f90:2051
  initial_@initial.f90:2610
  initdump_@initdump.f90:3622
  set_atm_pointers_@set_atm_pointers.f90:3538
  ukca_init_@ukca_init.f90:656
  ukca_scav_init$ukca_scavenging_mod_@ukca_scavenging_mod.f90:343
  um_fort_flush_@um_fort_flush.f90:48
  __flush_8@0x1d5d87a
  _FLUSH@0x1d5d6fa
  _ferr@0x1d5d2b6
  _fcleanup@0x1d8970c
  abort@abort.c:92
  raise@pt-raise.c:42

Is there perhaps some differences between my jobs that I haven't noticed?

Should I change the processors again? I don't want to do this as this would mean having to repeat xncae for consistency among these emissions experiments.

Please findthe .leave files in

/projects/ukca-ed/jakel/output/stored_output

Any ideas would be greatly appreciated.

Regards,

Jamie

Change History (1)

comment:1 Changed 2 months ago by luke

Hi Jamie,

I'm not sure about the xncah error - if this isn't repeatable it could be from a number of things. What seems to have happened is that processor number 74 couldn't write to the standard output stream, which is 6 and is then sent to the jobid.fort6.peNN files. The pe00 file is what becomes the output from the .leave file, but this is just from one of the many processors used in the simulation. If this happens again raise a ticket.

Are the jobs that used to take 2 hours 20 minutes nudged or free-running? Nudging generally adds time, so if this is new then that could explain things. Also, increasing processors can make things faster, but not always massively faster, see e.g.

http://www.ukca.ac.uk/wiki/index.php/Vn8.4_GA4.0_Release_Candidate:_RC6.0#Scaling_.28MONSooN.29

Sometimes increasing the number of processors will make things run slower. In the first instance I would suggest changing the run length from 1-month to something like 15-20 days (making sure it's a multiple of the dump period). However, this will change the results due to the way the UKCA solver is initialised (i.e. 2 identical jobs, one in 10-day steps and one in 30-day steps would give different results, although scientifically they should be the same).

Would you be able to explain a bit more about the differences between the jobs?

Many thanks,
Luke

Note: See TracTickets for help on using tickets.