Opened 3 years ago

Closed 3 years ago

#1776 closed help (answered)

Job stops midway through the run and does not archive properly

Reported by: dilshadshawki Owned by: annette
Component: UM Model Keywords: archiving, ocean, dump
Cc: Platform: MONSooN
UM Version: 8.2

Description

Hi Helpdesk,

I am running a job, xlzch, I was able to run it successfully for 10 years. However since then I have run it again, only making changes to an emissions ancillary file, now it only runs up to 6 years. I thought this was strange but then I just ran it again from the start in case this was just a random glitch, but the same thing is happening.

Here is the latest .leave file but it contains virtualy no information regarding why the simulation stopped after 6 years:

/home/dshawk/output/xlzch027.xlzch.d15351.t144544.leave

There is another thing that I noticed, which I also noticed with another job that also failed to run after 30ish years, xlzci (a copy of xlzch with some changes to the emissions), which is that there seems to be a problem with the archiving of ocean dump files, as well as other ocean and sea ice files. I'm not sure if the scripts that I am using are working properly.

The scripts can be found here:

/home/dshawk/hadgem3_scripts


Also, it should be automatically archiving to MASS via MOOSE, but it only seems to save the atmos files and not the ocean and sea ice files.

moo ls moose:/crum/xlzch
moo ls moose:/crum/xlzci

Finally, xlzci also failed and only seems to have exceeded the time limit:

/home/dshawk/output/xlzci152.xlzci.d15351.t093015.leave

Please accept my apologies as this ticket has many questions but they all seem to be linked.

Please could you provide some assistance on this issue?

Many thanks,

Best,
Dill

Change History (5)

comment:1 Changed 3 years ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to assigned

Hi Dill,

Response to each of your issues below:

1. xlzch

This job has crashed with core dumps - an ls -alrt in the job directory shows the following:

...
-rw-------  1 dshawk ukca-imp  937418752 Dec 17 15:07 core.atp.277318.153
-rw-------  1 dshawk ukca-imp  654651392 Dec 17 15:07 core.atp.277318.0
-rw-------  1 dshawk ukca-imp  938131456 Dec 17 15:07 core.atp.277318.87
-rw-------  1 dshawk ukca-imp  965566464 Dec 17 15:07 core.atp.277318.161
-rw-------  1 dshawk ukca-imp 1021161472 Dec 17 15:07 core.atp.277318.1
-rw-------  1 dshawk ukca-imp  929705984 Dec 17 15:07 core.atp.277318.3
-rw-------  1 dshawk ukca-imp  682602496 Dec 17 15:07 core.atp.277318.162
...

There is actually an error message in the leave file but it is very well hidden! I only found it by searching for "core":

Forcing core dumps of ranks 87, 1, 3, 153, 161, 162, 0
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

The log for pe87 (/projects/ukca-imp/dshawk/xlzch/pe_output/xlzch.fort6.pe87) shows the following:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: glue_conv
? Error Code:     2
? Error Message: Deep conv went to model top at point           27 in seg   2 on
 call  1
? Error generated from processor:    87
? This run generated   2 warnings
????????????????????????????????????????????????????????????????????????????????

Which is another model instability, as can be seen by the NaNs in the fields printed before the message.

You might want to double check the emissions files look OK, especially around the date the model crashed. There may be something else going on with these unstable jobs though.

2. Archiving

The scripts you are using do not move the NEMO and CICE files to MASS. Instead the scripts just try to copy the files to the same directory:

mv -f /projects/ukca-imp/dshawk/xlzch/xlzcho_1y_23001201_23011130_grid_T.nc /projects/ukca-imp/dshawk/xlzch/xlzcho_1y_23001201_23011130_grid_T.nc

Possibly these were supposed to copy data to the /nerc disk on the old MONSooN?

Anyway there are scripts to archiving NEMO and CICE files to MASS here:

/projects/ocean/hadgem3/scripts/GC2.0_XC40/
cice_archive.sh
nemo_archive.sh

These are a bit different to your versions however.

3. xlzci

I think the model has completed OK but it hasn't finished running the post-processing scripts to rebuild NEMO output etc. I wonder whether this should be offloaded to a separate serial job. We did this for the vn7.3 coupled model. I will look into it and get back to you.

Hope this helps.

Annette

comment:2 Changed 3 years ago by annette

Hi Dill,

Following up to point 3) above, to move the NEMO-CICE post-processing and archiving to the archiving serial job do the following:

  • Switch off the script releases:
    In Atmosphere → Control → Post-processing, Dumping & Meaning → User script releases
    "Specify release of user-supplied scripts…" select "No".
  • Include the following hand-edit (Input/Output Control and Resources → User hand-edit files):
    ~annette/hand_edits/vn8.2_archiving_cpl.ed

This calls the following standard scripts:

/projects/ocean/hadgem3/scripts/GC2.0_XC40/nemo_restarts.sh
/projects/ocean/hadgem3/scripts/GC2.0_XC40/nemo_mean.sh
/projects/ocean/hadgem3/scripts/GC2.0_XC40/nemo_archive.sh
/projects/ocean/hadgem3/scripts/GC2.0_XC40/cice_mean.sh
/projects/ocean/hadgem3/scripts/GC2.0_XC40/cice_archive.sh

Annette

comment:3 Changed 3 years ago by dilshadshawki

Hi Annette,

I have my jobs already using those scripts, but I have also edited them myself (from point 2) above).

Basically the edit makes it store restarts for September, rather than December and June as in the original ones.

Should I still go ahead and make the changes as above?

Cheers,
Dill

comment:4 Changed 3 years ago by annette

Hi Dill,

You will need to create a new master archiving script that calls your NEMO/CICE scripts, and a hand-edit to point to the master archiving script.

What you need to do is:

1) On MONSooN, take a copy of the coupled model archiving script /projects/um1/archiving/bin/um_archiving_cpl. It looks like:

#!/bin/ksh

# Atmos archiving
python /projects/um1/archiving/bin/um_archiving.py $1 

# Set up env vars required for NEMO/CICE arhciving
export RUNID=$2
export DATAM=$1
export JOBDIR=$3
export NEMO_NPROC=$4
previous_PWD=$PWD

cd $1

# NEMO/CICE archiving
/projects/ocean/hadgem3/scripts/GC2.0_XC40/nemo_restarts.sh
/projects/ocean/hadgem3/scripts/GC2.0_XC40/nemo_mean.sh
/projects/ocean/hadgem3/scripts/GC2.0_XC40/nemo_archive.sh
/projects/ocean/hadgem3/scripts/GC2.0_XC40/cice_mean.sh
/projects/ocean/hadgem3/scripts/GC2.0_XC40/cice_archive.sh

cd ${previous_PWD}

2) Edit the list of files under #NEMO/CICE archiving to be the set you wish to run.

3) On puma, create a hand-edit that looks like the following:

#!/bin/ksh
#
# Hand-edit for vn8.2 MOOSE archiving on MONSooN Cray. 
# Run nemo/cice post-processing and arhciving scripts in archiving job. 

ed SUBMIT <<\EOF
/^archi_name=/
c
archi_name=um_archiving_cpl
.
/^archi_path=/
c
archi_path=~/archiving
.
/^$archi_path\/$archi_name/
s/$/ $RUNID $JOBDIR $NEMO_NPROC/
.
w
q
EOF

Set archi_name and archi_path to be the filename and location of your master archiving script on MONSooN.

4) Make the hand-edit executable with chmod a+x.

5) Add it to your job. Save and process, then check the ~/umui_jobs/<job-id>/SUBMIT file to see that archi_name and archi_path variables have been updated.

Let me know if you have any difficulties with this.

Annette

comment:5 Changed 3 years ago by annette

  • Resolution set to answered
  • Status changed from assigned to closed

Closing ticket due to lack of activity.

Annette

Note: See TracTickets for help on using tickets.