Opened 3 years ago

Closed 2 years ago

#1871 closed help (answered)

Run failure

Reported by: simon.tett Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 8.5

Description

Hi,

my experiment, xldsm, failed with what I think is a qsserver failure which then triggered a model failure. Or it may be the other way around.. Job output can be found at ~stett2/output/xldsm000.xldsm.d16117.t140617.leave on archer.

Was this a server failure? If so what do I do to fix it…

If not I guess it is probably my problem..

Simon

Change History (37)

comment:1 Changed 3 years ago by grenville

Simon

The problem has occurred in the archiving which has then instructed the model to stop. However, the source of the problem is not clear. Is your connection to espp1 still OK?

Grenville

comment:2 Changed 3 years ago by simon.tett

I don't think I understand the question! Do I need to make the /archive/…./experID dir?

Simon

comment:3 Changed 3 years ago by grenville

Simon

The archiving uses the post processing machine (espp1) to do the work, so needs to have set up ssh keys for password-less comms between internal ARCHER components - I'm sure you have done this - sometimes the comms go awry. Can you successfully do this at the archer command line

ssh espp1

Please change permissions on /nerc/n02/n02/stett2 so that we can see in.

Grenville

comment:4 Changed 3 years ago by simon.tett

Tried:
stett2@eslogin002:~> ssh espp1
Warning: Permanently added 'espp1,10.10.50.21' (ECDSA) to the list of known hosts.
Password:

so you are right there is a problem…

I've given all and group rx permission on /nerc/n02/n02/stett2
and /nerc/n02/n02/stett2/archive/xldsm looks to have been created and got files in it..so maybe not got an archiving problem.

Simon

comment:5 Changed 3 years ago by grenville

Simon

I'd forgotten to specify the key in my previous message - it should have said this:

ssh -i ~stett2/.ssh/um_arch espp1

if this gets you to espp1 without prompting for a password, the comms are OK

Grenville

comment:6 Changed 3 years ago by simon.tett

Did a 10 day crun in short Q. Archiving worked fine with no problem. Model ran beyond point of failure. So looks like some kind of glitch. Archiving of files seems to be taking a long time at around 100 seconds. Could that cause problems as I have several streams active. I could reduce the number of streams by removing diagnostics…

Simon

comment:7 Changed 3 years ago by grenville

Simon

We too have found the ff to pp conversion to be very slow and are thinking about ways around the problem for UM 10.x jobs.

The MO moved away from archiving as the model is running, preferring to wait until it's finished, then kick off an independent job. We stuck with the concurrent archiving model because space on /work was a problem (as is again). We'd have to investigate what happens if the model runs faster than archiving can keep up.

Grenville

comment:8 Changed 3 years ago by simon.tett

Hi Grenville,

just got another archive failure. Is there a work around? I am converting data to PP using fcm:um_br/dev/grenville/vn8.5_archive-bigend-pp/src

I ran a similar job xlds#j in October and it ran fine… (I've got a few science changes and an extra diagnostic…) Has something changed on archer?

Simon

comment:9 Changed 3 years ago by simon

Hi,

There may be a race condition between the control script and the archiving script. Can you add SETOPT to x in the script inserts and modification window and rerun. This'll provide more detailed output.

Simon.

comment:10 Changed 3 years ago by simon.tett

OK added. Reprocessed and put back in as a CRUN. I think that means we'll know in 3 days time….

Simon

comment:11 Changed 3 years ago by simon.tett

Well it failed Firday the 13th.. With an error:

--------------------------------------------------------------------------------
*** stett2   Job: 3674974.sdb   started: 13/05/16 10:56:08   host: mom4 ***
*** stett2   Job: 3674974.sdb   started: 13/05/16 10:56:08   host: mom4 ***
*** stett2   Job: 3674974.sdb   started: 13/05/16 10:56:08   host: mom4 ***
*** stett2   Job: 3674974.sdb   started: 13/05/16 10:56:08   host: mom4 ***

--------------------------------------------------------------------------------
/home/n02/n02/stett2/.profile[38]: module: not found [No such file or directory]
Cray PrgEnv already loaded
craype-network-aries
cce/8.3.7
cray-libsci/13.0.1
PrgEnv-cray/5.2.56
craype-ivybridge
cray-mpich/7.1.1
cray-mpich/7.2.6(29):ERROR:150: Module 'cray-mpich/7.2.6' conflicts with the currently loaded module(s) 'cray-mpich/7.1.1'
cray-mpich/7.2.6(29):ERROR:102: Tcl command execution failed: conflict cray-mpich

--------------------------------------------------------------------------------
NOTE: the perftools module has been loaded directly, without using the new
perftools-base module.

The Performance Analysis Tools will behave as expected. However, beginning with
release 6.4.0, the recommended module load order is to load the low-impact
perftools-base module that does not alter program behavior first, and then to
load the desired instrumentation module. The perftools-base module provides
access to man pages, Reveal, Cray Apprentice2, and the new instrumentation
modules. The additional modules that become available after perftools-base is
loaded are:

perftools                - full support, including pat_build and pat_report
perftools-lite           - default CrayPat-lite profile
perftools-lite-events    - CrayPat-lite event profile
perftools-lite-gpu       - CrayPat-lite gpu kernel and data movement
perftools-lite-loops     - CrayPat-lite loop estimates (for Reveal)
--------------------------------------------------------------------------------
*****************************************************************
     Version 8.5 template, Unified Model ,  Non-Operational
     Created by UMUI version 8.5                       
*****************************************************************
Host is mom4
PATH used = /home/y07/y07/cse/xalt/0.6.0/libexec:/home/y07/y07/cse/xalt/0.6.0/bin:/opt/cray/mpt/7.1.1/gni/bin:/opt/pbs/12.2.401.141761/bin:/opt/cray/rca/1.0.0-2.0502.57212.2.56.ari/bin:/opt/cray/pmi/5.0.7-1.0000.10678.155.25.ari/bin:/opt/cray/cce/8.3.7/cray-binutils/x86_64-unknown-linux-gnu/bin:/opt/cray/cce/8.3.7/craylibs/x86-64/bin:/opt/cray/cce/8.3.7/cftn/bin:/opt/cray/cce/8.3.7/CC/bin:/opt/cray/craype/2.4.2/bin:/opt/cray/llm/default/bin:/opt/cray/llm/default/etc:/opt/cray/xpmem/0.1-2.0502.57015.1.15.ari/bin:/opt/cray/ugni/6.0-1.0502.10245.9.9.ari/bin:/opt/cray/udreg/2.3.2-1.0502.9889.2.20.ari/bin:/opt/cray/lustre-cray_ari_s/2.5_3.0.101_0.35.1_1.0502.8640.15.1-1.0502.18911.12.4/sbin:/opt/cray/lustre-cray_ari_s/2.5_3.0.101_0.35.1_1.0502.8640.15.1-1.0502.18911.12.4/bin:/opt/cray/alps/5.2.3-2.0502.9295.14.14.ari/sbin:/opt/cray/alps/5.2.3-2.0502.9295.14.14.ari/bin:/opt/cray/sdb/1.0-1.0502.58450.3.27.ari/bin:/opt/cray/nodestat/2.2-1.0502.58998.2.7.ari/bin:/opt/modules/3.2.10.2/bin:/home/n02/n02/stett2/bin:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:.:/usr/lib/qt3/bin:/work/n02/n02/hum/vn8.5/cce/utils:/work/n02/n02/hum/bin:/work/n02/n02/hum/vn8.5/bin:/opt/cray/bin:/work/n02/n02/hum/fcm/bin:/home/n02/n02/stett2/bin:/work/n02/n02/hum/vn8.5/cce/utils:/work/n02/n02/hum/bin:/work/n02/n02/hum/vn8.5/bin:/work/n02/n02/stett2/xldsm/bin:/work/n02/n02/hum/vn8.5/cce/scripts:/work/n02/n02/hum/vn8.5/cce/exec
/var/spool/PBS/mom_priv/jobs/3674974.sdb.SC[354]: .[29]: set: -Y: unknown option
Usage: set [-sabefhkmnprtuvxBCGH] [-A name] [-o[option]] [arg ...]

So it never ran… I assume this is because I setopt to Y not x…

In the mode of failing fast to sort out problems would it be an acceptable use of the short Q to run the job in 10 day chunks in the short Q? A year would take about 10 hours that way and we'd see what the problems were quickly……

Simon

Last edited 2 years ago by ros (previous) (diff)

comment:12 Changed 3 years ago by grenville

Simon

It can be argued that the short queue is being used for development in this case, so probably acceptable. However, it might be simpler to switch off the archiving (I'm assuming that your model won't output TBs of data) and manage the data manually until we can implement the newer MO post processing model?

We have recently sped up ff2pp quite significantly - byte-swapping is still relatively expensive 'though. If there is race, abandoning byte swapping may help too.

Grenville

comment:13 Changed 3 years ago by simon.tett

Hi Grenville,

I was hoping to have archiving and conversion to PP be done as part of the running. That way the scientific analysis can be done easily. I think I could reduce my diagnostic list — would that help? And if so what kind of factor should I look for?

Perhaps a more strategic question — is it possible to convert field files to netCDF? Mike Mineter is investigating doing this in Edinburgh with HadCM3. Given that we have many more and better tools for working with netCDF that might be a better investment…

Simon

comment:14 Changed 3 years ago by grenville

Simon

CF-pyhton converts fields files to CF-netcdf (David Hassell's software) — it is available on the post processors at ARCHER and on the Analysis platform installed on the jasmin vm on the RDF analytics cluster.

I'm still not sure what the problem is with archiving, so am not sure if reducing stash will help.

Do you need bigendian pp?

Grenville

comment:15 Changed 3 years ago by simon.tett

Hi Grenville,

would a more reliable approach be just to archive data but not convert — i.e copy field files to the archive. Then later run the conversion to netCDF?

To just archive I think that I turn on:
fcm:um_br/dev/jeff/vn8.5_hector_monsoon_archiving/src
and turn off:
fcm:um_br/dev/grenville/vn8.5_archive-bigend-pp/src

Is this correct?

My experience was that it was a pain to post process the data to convert it to bigendian PP to analyse it.. So easier to have it running with the model. But better to actually get the model to run :-)

How would I convert fields files to netCDF? I'm not familiar enough with CF-Python to know…

Simon

comment:16 Changed 3 years ago by simon.tett

Hi Grenville,

so following our email exchange… Is your advice to go back to the standard archiving and then use CF-Convert to make netcdf data?

Simon

comment:17 Changed 3 years ago by grenville

Simon

For now, that's a good way to get netcdf. There should be no need to change archiving branch - just set FF2PP_HECTOR to "N".

Grenville

comment:18 Changed 3 years ago by simon.tett

HI Grenville,

I did a three month run in 10 day segments in the short Q. Turned off PP conversion and all seemed to have worked smoothly. I have lots of output!

I used cfa interactively to convert output field files on the archive to netcdf4 files. Conversion time was around 2 minutes per file. So not much different from pp conversion.

I think being able to convert to 32 bits will help as it will reduce data storage needs. cfa produces a lot of (to me) non informative output. Can this be reduced?

I think I will reduce my diagnostic needs — I basically have the standard set plus some…

Simon

comment:19 Changed 3 years ago by david

Hi Simon,

Conversion time was around 2 minutes per file

How big is each file? You are saving the PP→netCDF stage, of course ….

cfa produces a lot of (to me) non informative output. Can this be reduced?

Can you post the command you are running and a snippet of the STDOUT?

I'll fast track the 64→32 bit functionality, and add compression, to cfa (it's in the python API, but not yet as a command line option). 64→32 bit will be very quick, compression less so …

David

comment:20 follow-up: Changed 3 years ago by simon.tett

Hi David,

suspect 64 to 32 bit will be the most useful. Can live without compression. I suspect our normal mode will be to run cfa on the individual files. Then when run is done use cdo or nco to select out particular diagnostics and put them into one file per diagnostic. We can then compress the netcdf files and delete the field files…

Command is cfa -f NETCDF4 fieldfile.

The output I get is like so:
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
85000.0 None
85000.0
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993
17699.0995 None
17699.0995
17699.0995 None
17699.0995
36.666671 None
36.666671
19120.291 None
19120.291
19120.291 None
19120.291
19120.291 None
19120.291
19120.291 None
19120.291
87949.993 None
87949.993
87949.993 None
87949.993
87949.993 None
87949.993

comment:21 in reply to: ↑ 20 Changed 3 years ago by david

Replying to simon.tett:

Hi Simon,

Apologies for the 87949.993 None - a rogue debug print statement which I shall get rid of.

suspect 64 to 32 bit will be the most useful. Can live without compression. I suspect our normal mode will be to run cfa on the individual files. Then when run is done use cdo or nco to select out particular diagnostics and put them into one file per diagnostic. We can then compress the netcdf files and delete the field files…

I have implemented an option to write out in 32-bit precision: cfa <options> --single input_file(s). It'll on the RDF soon.

Why not use cf-python in a script to split the files up with no extra IO? Something like:

import cf
f = cf.read('file')
for g in f:
    sn = g.standard_name
    outfile = 'file_'+sn+'.nc'
    cf.write(g, outfile, single=True) # single=True => 64->32 bit

or for a subset of standard names:

import cf
f = cf.read('file')
for sn in subset_of_standard_names:
    g = f.select(sn)
    outfile = 'file_'+sn+'.nc'
    cf.write(g, outfile, single=True) 

You can also select by STASH code:

import cf
f = cf.read('file')
for g in f:
    stash_code = g.getprop('stash_code')
    outfile = 'file_'+stash_code+'.nc'
    cf.write(g, outfile, single=True)
import cf
f = cf.read('file')
for stash_code in subset_of_stash_codes:
    g = f.select({stash_code': stash_code})
    outfile = 'file_'+stash_code+'.nc'
    cf.write(g, outfile, single=True) 

Bar the use of "single", all of these are already possible.

Hope that helps - I'll let you know when the changes are in place (no silly print output, single precision option).

David

comment:22 Changed 3 years ago by simon.tett

Hi David,

that is looking good! Does cf.write append? Or would we be best doing:

import cf
f=cf.read(files) # files is a list of files to process
for g in f:

etc
etc

Simon

comment:23 Changed 3 years ago by david

Does cf.write append?

It ought to, but needs a bit of attention. I'm looking into it.

David

comment:24 Changed 3 years ago by simon.tett

After 6 days in the Q model ran and then it failed after 7 simulated months.. :-(

Looking at the output: ~stett2/output/xldsm000.xldsm.d16138.t124920.leave
the failure looks to because ssh to espp1 failed. (Line 68731 of the output)

I see there are other failures from qshector_arch which I think are because MAILMSG is not defined.

So five questions:
1) Why is the ssh failing (which seems to be running cp… )
2) Why use ssh? /nerc/../archive seems to be visible from the login node. Is it invisible from the worker nodes?
3) If archiving fails should the script pause 5 minutes and try again?
4) Why was my job in the Q for so long?
5) How should I define MAILMSG — it is not defined in qshector_arch nor, from the documentation, defined! Though is used in lots of files…

Simon

comment:25 Changed 3 years ago by grenville

1) Why is the ssh failing (which seems to be running cp… )

This is one for ARCHER - I'll follow up.

2) Why use ssh? /nerc/../archive seems to be visible from the login node. Is it invisible from the worker nodes?

Jobs are launched and controlled from MOM nodes - ARCHER have strict usage limitations on MOM nodes hence our use of espp1.

3) If archiving fails should the script pause 5 minutes and try again?

I am not aware of this ever being the case

4) Why was my job in the Q for so long?

This is simply down to load on the machine

5) How should I define MAILMSG — it is not defined in qshector_arch nor, from the documentation, defined! Though is used in lots of files…

It appears to be defined for MONSooN archiving but not fully implemented. I don't see anything of use being sent to MAILMSG.

Grenville

comment:26 Changed 3 years ago by simon.tett

Hi Grenville,

thanks.. Long Q's and archiving failures are painful.. My colleague Debbie who is running HadGEM3@N216 doesn't seem to be queuing as long as me.

My suggestion about trying again is 5 minutes is if this is a just a connection glitch then trying again in 5 minutes would likely work…

Looking at the output I think if MAILMSG is not defined then you get no output telling you why the archive failed (unless you have setopt -x). Should I add to to my UM JOB env variables as something like $DATAM/mailmsg.txt

Simon

comment:27 Changed 3 years ago by grenville

Simon

ARCHER favours jobs with large node requirements - this policy has been questioned but I'm doubtful that it will change. There is an alternative archiving strategy, now adopted at the MO and implemented by us in a very limited number of jobs, which waits until a job has finished a cycle, then processes data in a serial queued job- ie no concurrent data management. That might be a solution, but will need some work to implement/test.. in your job.

Setting MAILMSG in your environment should work, however, MAILMSG appears to hold useful messages like "Archiving failed", "failed to archive", so may be of little help debugging.

Grenville

comment:28 Changed 3 years ago by simon.tett

Hi Grenville,

This makes ARCHER a not so useful machine for atmospheric modelling where lots of independent jobs is a more natural model… If I want to modify the scripts what fcm magic do I need? I think it would make sense to modify the current archiving script and get it to sleep for a minute then try again. If that fails then give up…

Having MAILMSG set will mean I get some output when archiving fails…

Simon

comment:29 Changed 3 years ago by simon.tett

Hi Grenville,

any update on this?

Simon

comment:30 Changed 3 years ago by grenville

Simon

It's probably easiest in the first instance simply to modify the script directly in the xldsm job /bin directory (on /work) - if you're happy with your change, create a branch as you would for a code change and include it in "Use central script modifications" in FCM section of the umui.

Grenville

comment:31 Changed 3 years ago by simon.tett

Hi Grenville,

any response from ARCHER on why the ssh failed?

Simon

comment:32 Changed 3 years ago by grenville

Simon

Regrettably not - ARCHER do not keep logs on MOM-espp1 connections.

Grenville

comment:33 Changed 3 years ago by simon.tett

Hacked script but when I try and run in the short Q I got a failure in the serial Q (Build ?) job:

--------------------------------------------------------------------------------
*** stett2   Job: 3710667.sdb   started: 26/05/16 12:32:53   host: esPP001 ***
*** stett2   Job: 3710667.sdb   started: 26/05/16 12:32:53   host: esPP001 ***
*** stett2   Job: 3710667.sdb   started: 26/05/16 12:32:53   host: esPP001 ***
*** stett2   Job: 3710667.sdb   started: 26/05/16 12:32:53   host: esPP001 ***

--------------------------------------------------------------------------------
/home/n02/n02/stett2/.profile[49]: .: /work/n02/n02/hum/vn8.5/cce/scripts/.umsetvars_8.5: cannot open [No such file or directory]
/home/n02/n02/stett2/umui_runs/xldsm-147122828/umuisubmit_compile[41]: .: /work/n02/n02/hum/bin/loadcomp: cannot open [No such file or directory]
--------------------------------------------------------------------------------

Resources requested: ncpus=1,place=free,walltime=04:00:00
Resources allocated: cpupercent=0,cput=00:00:00,mem=0kb,ncpus=1,vmem=0kb,walltime=00:00:02

*** stett2   Job: 3710667.sdb   ended: 26/05/16 12:32:55   queue: serial ***
*** stett2   Job: 3710667.sdb   ended: 26/05/16 12:32:55   queue: serial ***
*** stett2   Job: 3710667.sdb   ended: 26/05/16 12:32:55   queue: serial ***
*** stett2   Job: 3710667.sdb   ended: 26/05/16 12:32:55   queue: serial ***
--------------------------------------------------------------------------------

I then qsubed the run script which after I fixed bugs in the modified qshector_arch script worked.. I have now got it running in the short Q…


Last edited 2 years ago by ros (previous) (diff)

comment:34 Changed 3 years ago by simon.tett

and continuing to abuse the short Q… I got an ssh failure which when the archive script tried again it worked.. I think ssh to the epps1 is a bit unreliable particularly for files around ½ gbyte or so in size.

To see how I do it see (on ARCHER) ~stett2/qxhector_arch. Would it make sense to merge this into your archive script so others can benefit from it. Or am I the only one having trouble with archiving?

Simon

comment:35 Changed 3 years ago by simon.tett

Still getting occasional ssh failure (and abusing short Q) so multiple tries at ssh important!
Simon

comment:36 Changed 3 years ago by simon.tett

Continuing my saga. I think there is some problem with the disk system on ARCHER.
I have been getting timeout failures in the short Q as copies of files from work to archive are taking almost 15 minutes….

Simon

comment:37 Changed 2 years ago by ros

  • Resolution set to answered
  • Status changed from new to closed

For workaround to stop ssh failures in archiving see #2020

Note: See TracTickets for help on using tickets.