Opened 4 months ago

Last modified 6 weeks ago

#2148 pending help

UM v7.3 ARCHER job giving non-specific failure in qsexecute after 50 days (xncwd: Nitrate-extended UM-UKCA job)

Reported by: gmann Owned by: um_support
Priority: normal Component: UM Model
Keywords: ukca Cc: ee10hp, mdalvi, nbellouin, grenville
Platform: ARCHER UM Version: 7.3

Description

Dear NCAS-CMS helpdesk,

I am running v7.3 UM simulations with an enhanced UKCA setup (includes
"nitrate-extended" version of GLOMAP with extra transported aerosol tracers).

I have been evolving the setup of this v7.3 "nitrate-extended UKCA" in
preparation for several "aerosol hindcast" model experiments PhD student
Hana Pearce (Leeds) has chosen to design to understand how representing
the effects from semi-volatile aerosol (which partition into and out of
the particle phase) affects how composition-climate models predict
aerosol radiative forcings have played out over recent decades.

In particular, I have now got the "RADAER-coupled" configuration of UM-UKCA
working on ARCHER (with considerable help from Grenville Lister and Mohit
Dalvi).

That seemed to be working fine as it ran a month with this setup and the
model derived aerosol extinction profiles etc. that Hana wants to investigate
in relation to this issue are clearly appearing correctly within the first
monthly-mean of the simulation.

However, the model is crashing part-way through the 2nd month (just after
day 50 — around timestep 3610) with no particular error message — just
gives the non-specific message "qsmaster: Failed in qsexecute in model xncwd".

Initially when I got this crash I was assuming this was something to do with
the automatic post-processing.

The simulations are running with nudging (to ERA-interim winds & temperatures)
and as such they output a dump each day — which creates rather a lot of data.

The initial simulations of xncwd I ran had automatic post-processing switched
on with "delete superseded dumps" selected.

We've had a lot of problems with seemingly random "qsserver" crashes
occurring for UKCA simulations that were set to archive to /nerc on ARCHER
and despite raising with the NCAS-CMS helpdesk this has proven to be a
difficult problem that still has not been fixed (to my knowledge).

As a consequence, we have been running all our simulations with automatic
post-processing switched off (as advised by NCAS-CMS) and we use our own
scripts to move the data over to the /nerc archive as the runs go along
and usually have to keep re-submitted CRUN chunks as we move the data.

Anyway — the reason I explain that is because I'm pretty sure that this
"qsexecute" error is indicating some similar "system issue" (rather than
a problem with the UKCA module's code/settings) and my initial thought
was that the issue was potentially something to do with this automatic
post-processing of the dumps.

I tried (earlier today) re-running the xncwd job with the automatic
post-processing switched off but that proceeded to crash with the
same error (no change in behaviour as far as I could tell).

I moved the original run with the auto post-processing switch ON to:

/work/n02/n02/gmann/um/xncwd_qsexecuteFailurePPon/

with it running as specified in:

/home/n02/n02/gmann/umui_runs/xncwd-107001442

The log file from that original job is this one:

/home/n02/n02/gmann/output/

The re-run simulation with the auto post-processing switched OFF is at:

/work/n02/n02/gmann/um/xncwd/xncwd000.xncwd.d17107.t001453.leave

with it running as specified in:

/home/n02/n02/gmann/umui_runs/xncwd-107104041

The log file from that job with the post-processing OFF is this one:

/home/n02/n02/gmann/output/xncwd000.xncwd.d17107.t104052.leave

Perhaps it is not the automatic post-processing but some other
aspect of the simulations that is causing this error?

It's strange though because it runs the first 50 days fine and
only has this problem after that time.

That seemed consistent with the automatic post-processing only
doing the deletion of daily dumps at about that time but then
maybe it does this as it's going along?

Is there some other aspect of the model that would "kick-in" at
this 1-month-and-a-half of runtime that might cause this crash?

Another possibility I considered was that it was caused by the
large amount of STASH requested in the job — but we have run
long simulations already with this being requested and although
it slows the model down considerably,it is part of Hana's research
to investigate how the aerosol profiles evolve through the day as
the gas-particle partitioning and photochemistry vary strongly
with changes in daylight and temperature.

The xncwe runs included additional hourly output streams (to UPJ)
so that this extra info on the aerosol was output in profiles at
selected gridboxes (ground site locations) and in the full 3D
domain regionally (over Western Europe) to enable post-processing
and comparing to aircraft measurements through the EUCAARI field
campaign in 2008.

In xncwd I removed all those hourly STASH requests but still the
model crashes so I don't think it is that.

One other possibility I thought of was maybe with the additional
diagnostics for AOD in each mode and extinction at 550nm and 1020nm
requested in the job maybe this could have tipped the requests "over
the edge" in some way and caused the model to stop?

But then the error message is not a seg-fault so I don't think that
is the case — it's just a qsexecute/qsmaster error which to me
indicates some problem with something associated with the aspects
of the model that are carried out in those scripts (rather than
something enacted within the FORTRAN code itself).

Please can you have a look at the log files and scripts/executables
to see if you can see what is causing the model runs to fail at
this 50-day point in the simulations.

Many thanks for your help,

Cheers
Graham

Change History (8)

comment:1 Changed 4 months ago by mdalvi

Hi Graham,

The run actually fails at 6 hours into the 51st day, so might not be related to any background processes.

There are no clear error messages but .leave file seems to indicate that different PEs are at different stages in the run (some have not gone into UKCA, while some are reporting extreme values in GLOMAP). It might be useful to re-run with Flushing of output buffers switched ON, in case that gives a better idea of the location, or even print out any specific warnings for that timestep:
Atmosphere —> Section by section Choices—> Sec13: Diffusion and Filtering —> DIAG_PRN panel —> Flush print buffer if run fails

comment:2 Changed 4 months ago by gmann

Hi Mohit,

Thanks for this.

Co-incidentally (presumably…) it is exactly 51 days until the snap
General Election that Theresa May has just announced this morning.

Also to note — the warnings are just informational and are just
indicative of where particle concentrations are very low and the
UM advection separately transporting the multiple chemically-interacting
mass mixing ratios within each mode can introduce some artefacts at
low number concentrations.

They are informational and expected — and can basically be ignored.

I'm pretty certain the crash is being caused by something else and is
wholly unrelated to those warning/informational messages.

Anyway — I will try re-rerunning the xncwd again with the "Flush print
buffer if run fails" selected — and also with VERBOSE=2 which will then
enable to evaluate whether my hunch is indeed correct and the aerosol
properties/size simulated in the model is proceeding OK at the point of
the crash.

I will keep you posted.

Cheers
Graham

comment:3 Changed 4 months ago by willie

Hi Graham,

I have had a look at this and made some small progress. I tried a run without STASH and this failed in the same way. I tried a run with the compiler debug switch on and this failed at time step 75 (see my xnjid),

 UM ERROR (Model aborting) :
 Routine generating error: ukca_calc_drydiam
 Error code:  1
 Error message: 
 dvol or drydp <= 0

So we can say it is not STASH related and likely to be a bug in UKCA. This is about as far as I can get.

I did try some runs with the 8.4.1 compiler (instead of 8.3.7) and this crashed at time step 144 while writing the partial sums, but I'm not confident in this result.

Regards
Willie

comment:4 Changed 3 months ago by gmann

Hi Willy,

Thanks for this — I've tried to follow your lead there and do a repeat
run with the debug options switched on.

Accordingly I took a copy of the RADAER-coupled UKCA-nitrate job xncwd
that crashed after ~50.25 days (as reported above) and referred to your
job xnjid for what to add in to re-run with debugging options enabled.

From looking at that job xnjid I could only see that you had changed the
compiler optimisation settings from "safe" to "debug".

I was assuming that, although that says compiler *optimisation*, it also
activated the compiler options for debug as well.

You'd explained that when you re-ran the 51-day-failing job it then failed
after only 75 timesteps — and when you tried again with the 8.4.1 compiler
it failed after only 144 timesteps.

My run (xncwa) with compiler optimisation set to "debug" (rather than
"safe") then failed to get going at all however giving the peculiar error
as below.

That stumped me for a while but I discussed with my PhD student Hana Pearce
at Leeds and she explained that she had raised an NCAS-CMS helpdesk query
last year specifically asking about debug options (see helpdesk ticket 1828
entitled "Debug options on ARCHER compiler"):

http://cms.ncas.ac.uk/ticket/1828

See there she had attempted to activate the compiler debug options by using a
"compiler over-ride" file:

/home/ee10hp/overrides/archer_cce_debug_7.3_user

but had not quite got the right settings for ARCHER.

Ros had explained that she needed to use some ARCHER-specific options, so
she revised that to instead use the compiler over-ride as:

/home/ee10hp/overrides/archer_cce_debug_7.3_ros

Hana explained that when she did that the debug options were
then activated correctly on the Cray compiler (cce) on ARCHER.

I am going to try re-running just now using that over-ride.

I noticed that in fact the compiler optimsation remained at
"safe" rather than "debug" so it looks like it's not needed
to change that optimisation setting after all.

I will update the post to let you know how that goes.

Thanks
Graham

fcm_internal compile:F UM__atmosphere__boundary_layer /home/n02/n02/gmann/um/xncwa/ummodel/ppsrc/UM/atmosphere/boundary_layer/btq_int.f90 btq_int.o
cd /home/n02/n02/gmann/um/xncwa/ummodel/tmp
# Start: 2017-05-04 20:23:32=> ftn -o btq_int.o -I/home/n02/n02/gmann/um/xncwa/ummodel/inc -I/home/n02/n02/gmann/um/xncwa/umbase/inc -e m -h noomp -s real64 -s integer64 -hflex_mp=intolerant -I /work/n02/n02/hum
/gcom/cce/gcom3.8/archer_cce_mpp/inc -g -c /home/n02/n02/gmann/um/xncwa/ummodel/ppsrc/UM/atmosphere/boundary_layer/btq_int.f90
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s: Assembler messages:
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:26: Error: junk at end of line, first unrecognized character is `"'
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:34: Error: junk at end of line, first unrecognized character is ` '
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:45: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:46: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:47: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:48: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:49: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:50: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:51: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:56: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:60: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:61: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:62: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:63: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:64: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:74: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:75: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:76: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:77: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:78: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:79: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:80: Error: invalid character (0x9) in mnemonic
/work/n02/n02/gmann/tmp/tmp.esPP002.77397/pe_12623/vertical_diffs_1.s:81: Error: invalid character (0x9) in mnemonic


comment:5 Changed 3 months ago by willie

  • Status changed from new to pending

comment:6 Changed 6 weeks ago by grenville

Hi Graham

Is this still causing problems?

Grenville

comment:7 Changed 6 weeks ago by gmann

Hi Grenville,

If by "this" you mean the nitrate coupling to RADAER, and the operation of
the "old method" for enacting the RADAER coupling of GLOMAP to the radiative
transfer model (i.e. via the small exectuables) then the answer to that is no.

I managed to get that working well — my job xncwd worked fine once I'd had some
help from Mohit and others to work out how to compile the small executables.

However — there continues to a problem with the automatic archiving at v7.3

The key difference between my crashing job xncwa and the one that worked OK
after that is that I turned off the automatic post-processing and archiving.

This is causing a bit of a headache for Hana Pearce actually with her runs
(although that is not the only difficulty it's true).

Bascially the xncwd job (with automatic archiving switched off) became the
reference job that seemed to progressing OK.

See I then copied that xncwd to xnkbd, xnkbc and then to xnkbb.

Hana has been working off that xnkbb job as reference for her hindcasts.

And I will check with her by normal email (with you in cc)

I think she has been having some issues with her initial runs although
I'm not sure if those were specific to the initial test job she was doing.

I'll email her to check that now.

Cheers
Graham

comment:8 Changed 6 weeks ago by grenville

Hi Graham

We have very recently fixed a long running problem with archiving - if Hanna could contact us again, we can advise.

Best

Grenville

Note: See TracTickets for help on using tickets.