UM v7.3 ARCHER job giving non-specific failure in qsexecute after 50 days (xncwd: Nitrate-extended UM-UKCA job)
|Reported by:||gmann||Owned by:||um_support|
|Keywords:||ukca||Cc:||ee10hp, mdalvi, nbellouin, grenville|
Dear NCAS-CMS helpdesk,
I am running v7.3 UM simulations with an enhanced UKCA setup (includes
"nitrate-extended" version of GLOMAP with extra transported aerosol tracers).
I have been evolving the setup of this v7.3 "nitrate-extended UKCA" in
preparation for several "aerosol hindcast" model experiments PhD student
Hana Pearce (Leeds) has chosen to design to understand how representing
the effects from semi-volatile aerosol (which partition into and out of
the particle phase) affects how composition-climate models predict
aerosol radiative forcings have played out over recent decades.
In particular, I have now got the "RADAER-coupled" configuration of UM-UKCA
working on ARCHER (with considerable help from Grenville Lister and Mohit
That seemed to be working fine as it ran a month with this setup and the
model derived aerosol extinction profiles etc. that Hana wants to investigate
in relation to this issue are clearly appearing correctly within the first
monthly-mean of the simulation.
However, the model is crashing part-way through the 2nd month (just after
day 50 — around timestep 3610) with no particular error message — just
gives the non-specific message "qsmaster: Failed in qsexecute in model xncwd".
Initially when I got this crash I was assuming this was something to do with
the automatic post-processing.
The simulations are running with nudging (to ERA-interim winds & temperatures)
and as such they output a dump each day — which creates rather a lot of data.
The initial simulations of xncwd I ran had automatic post-processing switched
on with "delete superseded dumps" selected.
We've had a lot of problems with seemingly random "qsserver" crashes
occurring for UKCA simulations that were set to archive to /nerc on ARCHER
and despite raising with the NCAS-CMS helpdesk this has proven to be a
difficult problem that still has not been fixed (to my knowledge).
As a consequence, we have been running all our simulations with automatic
post-processing switched off (as advised by NCAS-CMS) and we use our own
scripts to move the data over to the /nerc archive as the runs go along
and usually have to keep re-submitted CRUN chunks as we move the data.
Anyway — the reason I explain that is because I'm pretty sure that this
"qsexecute" error is indicating some similar "system issue" (rather than
a problem with the UKCA module's code/settings) and my initial thought
was that the issue was potentially something to do with this automatic
post-processing of the dumps.
I tried (earlier today) re-running the xncwd job with the automatic
post-processing switched off but that proceeded to crash with the
same error (no change in behaviour as far as I could tell).
I moved the original run with the auto post-processing switch ON to:
with it running as specified in:
The log file from that original job is this one:
The re-run simulation with the auto post-processing switched OFF is at:
with it running as specified in:
The log file from that job with the post-processing OFF is this one:
Perhaps it is not the automatic post-processing but some other
aspect of the simulations that is causing this error?
It's strange though because it runs the first 50 days fine and
only has this problem after that time.
That seemed consistent with the automatic post-processing only
doing the deletion of daily dumps at about that time but then
maybe it does this as it's going along?
Is there some other aspect of the model that would "kick-in" at
this 1-month-and-a-half of runtime that might cause this crash?
Another possibility I considered was that it was caused by the
large amount of STASH requested in the job — but we have run
long simulations already with this being requested and although
it slows the model down considerably,it is part of Hana's research
to investigate how the aerosol profiles evolve through the day as
the gas-particle partitioning and photochemistry vary strongly
with changes in daylight and temperature.
The xncwe runs included additional hourly output streams (to UPJ)
so that this extra info on the aerosol was output in profiles at
selected gridboxes (ground site locations) and in the full 3D
domain regionally (over Western Europe) to enable post-processing
and comparing to aircraft measurements through the EUCAARI field
campaign in 2008.
In xncwd I removed all those hourly STASH requests but still the
model crashes so I don't think it is that.
One other possibility I thought of was maybe with the additional
diagnostics for AOD in each mode and extinction at 550nm and 1020nm
requested in the job maybe this could have tipped the requests "over
the edge" in some way and caused the model to stop?
But then the error message is not a seg-fault so I don't think that
is the case — it's just a qsexecute/qsmaster error which to me
indicates some problem with something associated with the aspects
of the model that are carried out in those scripts (rather than
something enacted within the FORTRAN code itself).
Please can you have a look at the log files and scripts/executables
to see if you can see what is causing the model runs to fail at
this 50-day point in the simulations.
Many thanks for your help,