Opened 6 days ago

Last modified 5 days ago

#2148 new help

UM v7.3 ARCHER job giving non-specific failure in qsexecute after 50 days (xncwd: Nitrate-extended UM-UKCA job)

Reported by: gmann Owned by: um_support
Priority: normal Component: UM Model
Keywords: ukca Cc: ee10hp, mdalvi, nbellouin, grenville
Platform: ARCHER UM Version: 7.3


Dear NCAS-CMS helpdesk,

I am running v7.3 UM simulations with an enhanced UKCA setup (includes
"nitrate-extended" version of GLOMAP with extra transported aerosol tracers).

I have been evolving the setup of this v7.3 "nitrate-extended UKCA" in
preparation for several "aerosol hindcast" model experiments PhD student
Hana Pearce (Leeds) has chosen to design to understand how representing
the effects from semi-volatile aerosol (which partition into and out of
the particle phase) affects how composition-climate models predict
aerosol radiative forcings have played out over recent decades.

In particular, I have now got the "RADAER-coupled" configuration of UM-UKCA
working on ARCHER (with considerable help from Grenville Lister and Mohit

That seemed to be working fine as it ran a month with this setup and the
model derived aerosol extinction profiles etc. that Hana wants to investigate
in relation to this issue are clearly appearing correctly within the first
monthly-mean of the simulation.

However, the model is crashing part-way through the 2nd month (just after
day 50 — around timestep 3610) with no particular error message — just
gives the non-specific message "qsmaster: Failed in qsexecute in model xncwd".

Initially when I got this crash I was assuming this was something to do with
the automatic post-processing.

The simulations are running with nudging (to ERA-interim winds & temperatures)
and as such they output a dump each day — which creates rather a lot of data.

The initial simulations of xncwd I ran had automatic post-processing switched
on with "delete superseded dumps" selected.

We've had a lot of problems with seemingly random "qsserver" crashes
occurring for UKCA simulations that were set to archive to /nerc on ARCHER
and despite raising with the NCAS-CMS helpdesk this has proven to be a
difficult problem that still has not been fixed (to my knowledge).

As a consequence, we have been running all our simulations with automatic
post-processing switched off (as advised by NCAS-CMS) and we use our own
scripts to move the data over to the /nerc archive as the runs go along
and usually have to keep re-submitted CRUN chunks as we move the data.

Anyway — the reason I explain that is because I'm pretty sure that this
"qsexecute" error is indicating some similar "system issue" (rather than
a problem with the UKCA module's code/settings) and my initial thought
was that the issue was potentially something to do with this automatic
post-processing of the dumps.

I tried (earlier today) re-running the xncwd job with the automatic
post-processing switched off but that proceeded to crash with the
same error (no change in behaviour as far as I could tell).

I moved the original run with the auto post-processing switch ON to:


with it running as specified in:


The log file from that original job is this one:


The re-run simulation with the auto post-processing switched OFF is at:


with it running as specified in:


The log file from that job with the post-processing OFF is this one:


Perhaps it is not the automatic post-processing but some other
aspect of the simulations that is causing this error?

It's strange though because it runs the first 50 days fine and
only has this problem after that time.

That seemed consistent with the automatic post-processing only
doing the deletion of daily dumps at about that time but then
maybe it does this as it's going along?

Is there some other aspect of the model that would "kick-in" at
this 1-month-and-a-half of runtime that might cause this crash?

Another possibility I considered was that it was caused by the
large amount of STASH requested in the job — but we have run
long simulations already with this being requested and although
it slows the model down considerably,it is part of Hana's research
to investigate how the aerosol profiles evolve through the day as
the gas-particle partitioning and photochemistry vary strongly
with changes in daylight and temperature.

The xncwe runs included additional hourly output streams (to UPJ)
so that this extra info on the aerosol was output in profiles at
selected gridboxes (ground site locations) and in the full 3D
domain regionally (over Western Europe) to enable post-processing
and comparing to aircraft measurements through the EUCAARI field
campaign in 2008.

In xncwd I removed all those hourly STASH requests but still the
model crashes so I don't think it is that.

One other possibility I thought of was maybe with the additional
diagnostics for AOD in each mode and extinction at 550nm and 1020nm
requested in the job maybe this could have tipped the requests "over
the edge" in some way and caused the model to stop?

But then the error message is not a seg-fault so I don't think that
is the case — it's just a qsexecute/qsmaster error which to me
indicates some problem with something associated with the aspects
of the model that are carried out in those scripts (rather than
something enacted within the FORTRAN code itself).

Please can you have a look at the log files and scripts/executables
to see if you can see what is causing the model runs to fail at
this 50-day point in the simulations.

Many thanks for your help,


Change History (2)

comment:1 Changed 5 days ago by mdalvi

Hi Graham,

The run actually fails at 6 hours into the 51st day, so might not be related to any background processes.

There are no clear error messages but .leave file seems to indicate that different PEs are at different stages in the run (some have not gone into UKCA, while some are reporting extreme values in GLOMAP). It might be useful to re-run with Flushing of output buffers switched ON, in case that gives a better idea of the location, or even print out any specific warnings for that timestep:
Atmosphere —> Section by section Choices—> Sec13: Diffusion and Filtering —> DIAG_PRN panel —> Flush print buffer if run fails

comment:2 Changed 5 days ago by gmann

Hi Mohit,

Thanks for this.

Co-incidentally (presumably…) it is exactly 51 days until the snap
General Election that Theresa May has just announced this morning.

Also to note — the warnings are just informational and are just
indicative of where particle concentrations are very low and the
UM advection separately transporting the multiple chemically-interacting
mass mixing ratios within each mode can introduce some artefacts at
low number concentrations.

They are informational and expected — and can basically be ignored.

I'm pretty certain the crash is being caused by something else and is
wholly unrelated to those warning/informational messages.

Anyway — I will try re-rerunning the xncwd again with the "Flush print
buffer if run fails" selected — and also with VERBOSE=2 which will then
enable to evaluate whether my hunch is indeed correct and the aerosol
properties/size simulated in the model is proceeding OK at the point of
the crash.

I will keep you posted.


Note: See TracTickets for help on using tickets.