Opened 6 years ago

Closed 6 years ago

#1058 closed help (fixed)

Problem with post-processing/archiving with v7.3 nudged HadGEM-UKCA aerosol job on HECToR

Reported by: gmann Owned by: um_support
Component: UM Model Keywords: ukca
Cc: Platform: HECToR
UM Version: 7.3

Description

Dear NCAS-CMS helpdesk,

I contacted Grenville yesterday about HECToR archiving as he had been helping me with porting a HadGEM-UKCA aerosol job from MONSOON to HECToR.

Basically we have the HadGEM-UKCA aerosol job running now on HECToR.

However, the model run doesn't seem to be post-processing the daily dumps that it is producing…..

As per my email yesterday to Grenville, I referred to my jobs xhwrd (128 cores = 4 HECToR phase 3 nodes) and xhwre (256 cores = 8 HECToR phase 3 nodes).

In both those jobs, in the UMUI I've requested the model run to delete superceded restart dumps.

But it is not doing that when the model runs (see my email below).

Grenville replied this morning (see also below) explaining that I need to select the HECToR archiving as well (those jobs had "no archiving").

I copied that 256-core run xhwre to xfwrf and added those requests and also followed the instructions on the FAQs about setting up a job to archive to the HECToR archive. I requested an account for the Large Memory Server and set the job to archive to desk to my HECToR directory /nerc/n02/n02/gmann/UM_archiving

However, that xhwrf run seems to just be proceeding in exactly the same way as the job xhwre — the automatic-post-processing doesn't seem to be working.

Please can you advise what the problem is here?

Thanks a lot for your help,

Cheers
Graham

From: Grenville Lister g.m.s.lister@…
Sent: 30 April 2013 08:18
To: Graham Mann
Subject: Re: HECToR run not deleting superseded dumps

Graham

I ran this past Jeff Cole. He thinks you need to select HECToR archive (you currently have No archiving system) in the Post Processing window.

Regards

Grenville

On 04/26/13 13:49, Graham Mann wrote:
Hi Grenville,

My HadGEM-UKCA aerosol jobs submitting to HECToR (e.g. xhwrd on 128 cores, xhwre on 256 cores) seem not to be deleting superseded re-start dumps even through I have set them to do so.

My original job xhwrc (that I pointed you to for Polaris) had “Is automatic post-processing required” set to “N” – so I was not expecting that to delete the dumps.

Note that the job uses the Gregorian calendar as it is a nudged run, and the “Dumping and Meaning” is set to “Regular Frequency dumps for Gregorian calendar meaning” and has set “Restart Dumps every” to 24 hours.

But I copied my xhwrc job to my xhwrd and set that “Is automatic post-processing required” to “Y” and also clicked on “Delete superseded restart dumps”, “Delete superseded PP files” and “Delete superseded means files”.

Actually I probably shouldn’t have clicked on the delete superseded PP files and means files – really I just want to delete the superseded dumps….

But anyway – when I submitted that xhwrd it ran the first month NRUN step OK but all the daily dumps were still there – exactly as in my xhwrc run that had set automatic post-processing to “N”.

Do you know why this is?

Is it a problem related to the run using the Gregorian calendar?

Or have I not set something quite right here…..

Thanks a lot for your help,

Cheers
Graham


6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 17:53 xhwrea.dak4920
6489452 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 18:01 xhwrea.dak4930
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 18:08 xhwrea.dak4940
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 18:15 xhwrea.dak4950
6489460 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 18:23 xhwrea.dak4960
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 18:30 xhwrea.dak4970
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 18:37 xhwrea.dak4980
6489460 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 18:44 xhwrea.dak4990
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 18:51 xhwrea.dak49a0
6489460 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 18:58 xhwrea.dak49b0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 19:05 xhwrea.dak49c0
6489460 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 19:12 xhwrea.dak49d0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 19:19 xhwrea.dak49e0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 19:26 xhwrea.dak49f0
6489460 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 19:33 xhwrea.dak49g0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 19:39 xhwrea.dak49h0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 19:46 xhwrea.dak49i0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 19:54 xhwrea.dak49j0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 20:01 xhwrea.dak49k0
6489460 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 20:08 xhwrea.dak49l0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 20:15 xhwrea.dak49m0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 20:22 xhwrea.dak49n0
6489464 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 20:29 xhwrea.dak49o0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 20:36 xhwrea.dak49p0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 20:43 xhwrea.dak49q0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 20:51 xhwrea.dak49r0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 20:58 xhwrea.dak49s0
6489460 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 21:05 xhwrea.dak49t0
6489456 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 21:12 xhwrea.dak49u0
6489460 -rw-r—r— 1 gmann n02 6645170176 2013-04-24 21:19 xhwrea.dak4a10


Dr. Grenville Lister
Computational Scientist
National Centre for Atmospheric Science
Department of Meteorology
University of Reading
Early Gate
Reading RG6 6BB

email: g.m.s.lister@…
phone: 0118 378 6021

Change History (6)

comment:1 Changed 6 years ago by grenville

Hi Graham

Not sure if this is the solution, but you're pointing to an old archiving branch - Jeff's latest is revision 11307 (you are pointing to 9947).

Grenville

comment:2 Changed 6 years ago by gmann

Hi Grenville,
Oh — yes — sorry, I should have checked Trac for the appropriate revision number.
That r9947 was the revision I was using on MONSOON which worked OK archiving to /nerc on MONSOON….
But checking on Trac I see now Jeff has updated his branch several times since that revision…
In particular I see now that the revisions are clearly labelled with r10644 required to archive to /nerc on HECToR and that latest r11307 has improvements to use the Large Memory Server…..
Hopefully this will work OK now — thanks for your help,
I'll try to engage brain before asking for help next time!
Cheers
Graham

comment:3 Changed 6 years ago by gmann

Hi Grenville & NCAS CMS helpdesk-ers,

I updated my job xhwrc to use revision 11307 of Jeff's vn7.3_hector_monsoon_archiving branch (rather than r9947).

However, when I re-ran the job last night it still did exactly the same thing — I'm getting daily dumps in my /work/n02/n02/gmann/um/xhwrf/ directory and even though I'm asking for superseded dumps to be deleted, this is still not occurring.

And the job is also not copying any files to my HECToR archive directory at /nerc/n02/n02/gmann/UM_archiving/ despite this also being specified in the UMUI for this job.

Please can someone take a look and see if they can see what could be causing the problem…..

One thing has occurred to me that could be different about the job I'm running from usual UM jobs……

As well as using the UKCA sub-model including GLOMAP-mode, the job also uses the RADAER module which diagnoses the optical properties (scattering & absorption efficiencies, asymetry parameter) of the GLOMAP-mode simulated aerosol….

That RADAER module requires some additional scripts in order to operate on HECToR.

I'm wondering whether it could be possible that some aspect of the post-processing might rely on the some part of the script that has been modified by RADAER.

See in the UMUI my job xhwrf has active "Script Inserts & Modifications" using top-insert:

/work/n02/n02/mdalvi/smexec73_v2/copy_smexec73_v2

which copies some small executables associated with the RADAER code…..

This copies the 4 pre-built executables:
qxcombine qxpickup qxhistreport qxsetup

from Mohit's directory into the bin directory for the particular run:

cp /work/n02/n02/mdalvi/smexec73_v2/qx* $WORKDIR/um/$RUNID/bin

I'm wondering whether there could be some clash here in that maybe the HECToR-MONSOON archiving branch needs these in some modified form which then is over-written by the copying of the pre-built small executables?

Just a thought…..

An 2nd possibility is that I see that in the HECToR archiving FAQs at:

http://cms.ncas.ac.uk/wiki/Hector/NercArchiving

it says that:

"For the 7.3 version of HadGEM3-A_r2.0 a different branch must be used, namely fcm:um_br/dev/jeff/VN7.3_HadGEM3-A_r2.0_hector_monsoon_archiving/src. This branch is a replacement for the standard HadGEM3-A_r2.0 branch fcm:um_br/pkg/Share/VN7.3_HadGEM3-A_r2.0/src."

I did read this, but the thing is that I am using a UKCA branch that has been combined on top of the HadGEM3-A_r2.0 branch — that's my branch:

/dev/gmann/vn7.3_HG3r2_mergCJ_nprim_Radv2_HECToR

So I couldn't just switch to using Jeff's "VN7.3_HadGEM3-A_r2.0_hector_monsoon_archiving" branch.

For this reason, on MONSOON I've been using Jeff's "plain" VN7.3_hector_monsoon_archiving branch….

But it occurs to me that maybe there could be something here that is causing the archiving not to work with something that was added in to HadGEM3-A-r2.0????

By the way, that above branch includes the RADAER code (so could also be checked for any clashes with the code coming in via the HECToR archiving branch)….

Thanks for any help you can give,

Cheers
Graham

comment:4 Changed 6 years ago by gmann

Sorry the job is xhwrf not xhwrc

comment:5 Changed 6 years ago by jeff

Hi Graham

I have taken a look at your job and the reason the auto-archiving isn't working is two of the UKCA subroutines (fastj_inphot and fastjx_inphot) read a file on unit 8, this unit is needed by the archiving system and shouldn't be used elsewhere in the UM. The problem is by the time the archiving comes to read this unit it has been closed by the other routines and hence the output goes to fort.8 instead of the correct file. If you change the unit number in these routines to one unused by the UM hopefully the archiving will work correctly.

Jeff.

comment:6 Changed 6 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.