Opened 8 weeks ago

Last modified 12 days ago

#3100 accepted help

Post-processing error running UKESM on MONSOON during 5-month PI control run

Reported by: gmann Owned by: ros
Component: UKESM Keywords: MOOSE
Cc: earfw, gmann@… Platform: Monsoon2
UM Version: 11.2

Description

Dear NCAS-CMS helpdesk,

Wuhu Feng is getting a problem with post-processing on MONSOON when running the copy of the UKESM Pre-Industrial control.

He has successfully run a 20-year copy of the UKESM Pre-Industrial control on MONSOON (bc694) — that is his stuite bp061 — and that works fine and archives to MASS OK.

We have now progressed to running 5-month step through from 1st Jan 1991 to end-May 1991, so that we can then run the UKESM volc-pinatubo ensemble for VolMIP.

To do that, Wuhu has made 3 changes — that is his bp286

1) start time to 1st Jan 1991 and run-length to 5 months.

2) 5 restart dumps changed to selected year's initial conditions from the aw310 Pre-Industrial control (to give the required ENSO and NAO phase during the 1st post-eruption winter).

3) post-processing—> atmosphere —> archiving dump frequency to monthly from yearly

4) recycling-period (resubmission pattern) from 3-month in bc694 to 1-month

(the 4th one he only did in some runs and change it back).

The 5-month job job runs the 1st month OK and part-way through the 2nd month and then fails with error message:

[WARN] file:atmospp.nl: skip missing optional source: namelist:archer_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:archer_arch
[WARN] file:pptransfer.nl: skip missing optional source: namelist:archer_arch
[WARN] file:pptransfer.nl: skip missing optional source: namelist:pptransfer
[WARN] file:atmospp.nl: skip missing optional source: namelist:script_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:script_arch
[FAIL] check_directory: Exiting - Directory does not exist: /home/d03/wufen/cylc-run/u-bp286/work/19910301T0000Z/coupled
[FAIL] Terminating PostProc?
[FAIL] main_pp.py atmos # return-code=1
2019-11-27T19:58:51Z CRITICAL - failed/EXIT
~

This error message has happened on several tries of re-submitting so it is not a one-off sporadic failure, it's a consistent error that seems to be happening in this particular suite configuration.

Wuhu has sent me the file-path to the log files (see below) and is also puzzled by the way the model seems to only store some of these files during the period it is running for — it seems semi-random which directories are retained within that bp286/work/ directory.

Please can you advise what the problem is here, and what you think the problem is that is causing this post-processing error.

This is quite urgent, because these runs are for the CMIP6 VolMIP submission of UKESM, and we're nearly ready to submit the 27-member ensemble for the volc-pinatubo run, but these frustrating problems with post-processing are delaying us making progress.

Cheers
Graham

wufen@xcslc0:~/cylc-run/u-bp286/log/job/19910301T0000Z> ls -lrt /home/d03/wufen/cylc-run/u-bp286/work/
total 24
drwxr-xr-x 2 wufen mo_users 4096 Nov 27 16:22 19910201T0000Z
drwxr-xr-x 9 wufen mo_users 4096 Nov 27 19:49 19910301T0000Z
drwxr-xr-x 5 wufen mo_users 4096 Nov 27 20:08 19910111T0000Z
drwxr-xr-x 36 wufen mo_users 4096 Nov 27 20:08 19910101T0000Z
drwxr-xr-x 8 wufen mo_users 4096 Nov 27 20:08 19910106T0000Z
drwxr-xr-x 8 wufen mo_users 4096 Nov 27 20:08 19910116T0000Z


National Centre for Atmospheric Science
School of Earth and Environment, University of Leeds, Leeds, LS2 9JT
Tel: +44 113 343 3438
http://homepages.see.leeds.ac.uk/~earfw/

Change History (7)

comment:1 Changed 8 weeks ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Graham, Wuhu,

For cycle 19910301T0000Z the coupled model hasn't even run so that is why postproc is failing. From the log files I can see a mismash of cycles including some that are only 5 days long (e.g. 19910101T0000Z/ 19910106T0000Z/ 19910111T0000Z/ 19910116T0000Z/ 19910301T0000Z/) I would first of all suggest doing a clean run (rose suite-run —new) to remove any old files left over from previous attempts to run this suite. If it then fails again we will be better placed to see what is going on.

Regards,
Ros.

comment:2 Changed 6 weeks ago by gmann

  • Cc gmann@… added

Hi Ros,

Wuhu and I were at AGU last week, so we missed our usual weekly meeting.

But I discussed briefly with Wuhu at the conference on Friday.

He explained that he tried your suggestion of the "rose suite-run —new" on that
u-bp286 run, but it failed with the same problem again.

Beatriz Monge-Sanz (PDRA at Oxford with Lesley Gray) is working with us on the
UKESM VolMIP analysis, and

We're meeting on Thursday 3.30-4.30pm

comment:3 Changed 6 weeks ago by gmann

Hi Ros,

Wuhu and I were at AGU last week, so we missed our usual weekly meeting.

Sorry — I tried to modify the ticket to add Beatriz in on cc and thought I could
do that by clicking the "Add to cc" box, but it just submitted the update to the
ticket instead.

Anyway — to continue the post, basically we met with Beatriz on Monday
(she visited Leeds) and discussed this further — and Wuhu is going to try
running a fresh copy of that u-bp286 suite (to a different suite id) and see
if this again fails with the same error (we're guessing maybe it is something
to do with a previous crash of that u-bp286 that somehow hasn't quite
got removed with the rose suite-run —new.

Wuhu will post to this NCAS-CMS ticket what happens with the "fresh copy"
of the suite to a different suite-id.

We think this will fail again with the same error (because that's what happened
when he did the rose suite-run —new ) but we'll check that that is indeed the case
and reply to the ticket to confirm that.

Wuhu and I are meeting tomorrow afternoon 2.30-3.30pm, and I wonder whether
it could be possible to have a brief chat on the phone about this.

The thing is that it would be good for us to understand what the possible problem
might be here — my thinking was whether it could be that we've made some mistake
when changing the job to the 5-month duration 1-month-dumping run.

Maybe one of the 5 dumps for the coupled model is not quite set right, and this
is causing an internal conflict at the end of month 2?

It must be something like this because Wuhu's initial 20-year run ran fine.

As you pointed out in your reply, the coupled model hasn't run the 19910301T0000Z
cycle — so there must be some problem with the sequencing of the different parts of
the coupled model.

The only thing we've changed is for the dump-frequency but perhaps there is more
than one place we have to change this — for the ocean and ice sheet dumps as well
as for the atmosphere-model dump?

Please can you reply whether you will be around 2.30-3.30pm Thurs to potentially
have a quick chat on the phone so we can understand a bit more about what the
problem could potentially be here.

Maybe we need a quick re-cap of how one goes about changing the coupled model
dump frequency — we're used to running atoms-only runs and I guess we likely have
some additional learning here to understand how to do the equivalent in the coupled
model — it seems to me likely we've not quite done that correctly.

Would that be OK to talk on the phone (ideally 2.30pm or 3.00pm tomorrow but
can arrange for a different time if not possible then).

As I say, this is getting quite urgent and we'd like to put the ensemble of runs on
over the Christmas period if possible.

Thanks
Graham

comment:4 Changed 6 weeks ago by gmann

Sorry, we're meeting 3.30-4.30pm tomorrow not 2.30-3.30pm.

So my suggestion is to potentially talk on the phone at 3.30pm or 4.00pm.

Would that be possible at all?

Thanks
Graham

comment:5 Changed 6 weeks ago by ros

Hi Graham,

I'm on leave the rest of today, but I'm in tomorrow and can arrange to be around at 3:30pm but to be honest I'm not going to be able to help until a clean run of the suite is performed. I've just looked at the "re-run" of u-bp286 but all the datestamps are still from November 27th so I'm afraid that run hasn't be re-done with rose suite-run --new which I can also see from the last entry in the log file. rose suite-run --new deletes all the existing working directories including logs of a suite (ie. …./cylc-run/u-bp286) and starts afresh; this hasn't happened.

rhatcher@xcs-c$ pwd
/home/d03/wufen/cylc-run/u-bp286
rhatcher@xcs-c$ ls -l
total 1880
drwxr-sr-x  2 wufen ukca-leeds    4096 Nov 26 17:51 ana/
drwxr-xr-x 20 wufen mo_users      4096 Nov 26 17:42 app/
drwxr-xr-x  2 wufen mo_users      4096 Nov 26 17:42 bin/
lrwxrwxrwx  1 wufen ukca-leeds       6 Nov 26 17:51 cylc-suite.db -> log/db
lrwxrwxrwx  1 wufen ukca-leeds      20 Nov 27 16:20 log -> log.20191127T162037Z/
-rw-r--r--  1 wufen ukca-leeds 1767499 Nov 27 16:20 log.20191126T175056Z.tar.gz
drwxr-sr-x  6 wufen ukca-leeds    4096 Dec  2 00:04 log.20191127T162037Z/
drwxr-xr-x  2 wufen mo_users      4096 Nov 26 17:42 meta/
-rw-r--r--  1 wufen ukca-leeds     566 Nov 26 17:51 rose-suite.info
drwxr-sr-x 10 wufen ukca-leeds    4096 Nov 27 16:21 share/
drwxr-xr-x  2 wufen mo_users      4096 Nov 26 17:42 site/
-rw-r--r--  1 wufen ukca-leeds   20490 Nov 27 16:21 suite.rc
-rw-r--r--  1 wufen ukca-leeds   45585 Nov 27 16:21 suite.rc.processed
-rw-r--r--  1 wufen ukca-leeds    7294 Nov 26 17:51 tests-graph.rc
-rw-r--r--  1 wufen ukca-leeds   29150 Nov 26 17:51 tests-runtime.rc
-rw-r--r--  1 wufen ukca-leeds     910 Nov 26 17:51 ukesm-graph.rc
-rw-r--r--  1 wufen ukca-leeds    2071 Nov 26 17:51 ukesm-runtime.rc
lrwxrwxrwx  1 wufen ukca-leeds      40 Nov 26 17:51 work -> /working/d03/wufen/cylc-run/u-bp286/work/

Can Wuhu please try rose suite-run --new again for suite u-bp286 and post the terminal output to this ticket so I can see if there is a problem.

Regards,
Ros.

comment:6 Changed 6 weeks ago by gmann

Hi Ros,

Thanks for checking this — OK, sorry I thought he said he'd
tried that already and got the same error.

I must have misunderstood what he'd said there.

We were in touch via email earlier this morning and he indicated
he'd be able to put on the run I suggested today.

So I'll email him now to ask him to do the rose suite-run —new of
the u-bp286 run as well as doing a copy of the run and re-submitting.

I'll ask him to post the update to the ticket in the way you suggest.

Thanks
Graham

comment:7 Changed 12 days ago by ros

Hi Graham,

I assume this has been sorted out now and I can close this ticket.

Regards,
Ros.

Note: See TracTickets for help on using tickets.