Opened 5 years ago

Closed 4 years ago

#1587 closed help (answered)

8.2 dump in 7.5 model reconfiguration job

Reported by: chollow Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: <select version>

Description

Hi,

I'm testing a 7.5 reconfiguration job (for a limited-area 12-km model) for Fadzil to see if it will work for him. However, it looks like the UM start dump I am using as my input file (which was reconfigured from an ECMWF grib2 file with a special set of scripts) is version 8.2 (I assume this from the rcf. leave job, I've copied some output below). This may be why my reconfiguration is failing, although there is a more specific complaint in the leave file.

The job is xlned on ARCHER

The output file is: /home/n02/n02/chollow/output/xlned000.xlned.d15159.t182008.rcf.leave

The error is:

66 Rank 138 [Mon Jun 8 20:33:55 2015] [c3-1c0s13n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 138

67 Rank 120 [Mon Jun 8 20:33:55 2015] [c3-1c0s13n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 120
68 Error Code:- 2
69 Error Message:- Cant find required STASH item 376 section 0 model 1 in STASHmaster
70 Error generated from processor 0

I looked up this stash item (in an 8.2 job) and it is:
SNOW DEPTH ON GROUND IN TILES (M)

There are a few related ones, 377-386, which are also related to snow which I don't think are in version 7.5 (or at least not all of them).

Is there a way around this? Or is it never advisable to have your input dump be from a later model version?

Thanks,
Chris

Change History (9)

comment:1 Changed 4 years ago by grenville

Chris

Try telling the model to ignore this stash item - you'll need to make a user stash file - have a look at
/home/grenville/USERSTASH/241_ignore (on PUMA). You'd need to change the Section to 0, Item to 376 — the spacing in the stash record must not be changed. Include the user stash file in the user stashmaster section of the UMUI.

You may need to do this for more than one field.

Do you have a UM 7.5 dump from your previous runs? — I'm a bit unsure why it's asking about stash item 0 376 because it's not in your 7.5 model (xlnif).

Grenville

comment:2 Changed 4 years ago by grenville

— should have said because it's not in your 7.5 model (xlned).

comment:3 Changed 4 years ago by chollow

The single file to ignore item 0 376 worked, thanks.

However, I now have a job with the actual run (xlnee) which uses an executable from xlneb

This run stops after about 2 seconds with a tiny leave file:
/home/n02/n02/chollow/output/xlnee000.xlnee.d15168.t152756.leave

The only error I can find is:
/var/spool/PBS/mom_priv/jobs/2959144.sdb.SC[11]: .[329]: .: UMScr_TopLevel: cannot open [No such file or directory]

I don't know what's wrong, I ran it twice but same thing.

comment:4 Changed 4 years ago by grenville

Chris

It looks like you don't have a bin directory for the xlnee job - copy over the bin directory from /work/n02/n02/chollow/xlneb into /work/n02/n02/chollow/xlnee.

Grenville

comment:5 Changed 4 years ago by chollow

I fixed the above by "enable build of UM scripts". Then after compiling scripts, switching this back off, and running again, it failed slightly later, so I turned off archiving. After this, there was one more error (too many NS and EW processors, max 10 allowed) so I changed the NS and EW processors to 8 each. After this the run ran 1 full day. Yay! My only problem is the archiving, which generated the following error:

39 lib-4536 : UNRECOVERABLE library error

40 Assign processing requires that environment variable FILENV be set.
41 Rank 11 [Wed Jun 17 17:25:53 2015] [c2-1c0s1n0] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 11
42 Rank 132 [Wed Jun 17 17:25:53 2015] [c2-1c0s6n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 132
43 Rank 143 [Wed Jun 17 17:25:53 2015] [c2-1c0s6n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 143
44 _pmiu_daemon(SIGCHLD): [NID 01947] [c2-1c0s6n3] [Wed Jun 17 17:25:53 2015] PE RANK 132 exit signal Aborted
45 [NID 01947] 2015-06-17 17:25:53 Apid 15100233: initiated application termination
46 xlnee: Run failed

This is in leave file:
/home/n02/n02/chollow/output/xlnee000.xlnee.d15168.t163345.leave

not sure why archiving didn't work, but otherwise run seems OK.

Chris

comment:6 Changed 4 years ago by grenville

Chris

Did archiving ever work?

Grenville

comment:7 Changed 4 years ago by chollow

Hi Grenville,

No, I tried ticking "yes" for archiving and re-running, it ran out of time but I don't think that was the real problem, here is the excerpt from the latest xlnee leave file:

39 lib-4536 : UNRECOVERABLE library error
40 Assign processing requires that environment variable FILENV be set.
41 cp: writing `/nerc/n02/n02/chollow/chollow/xlnee/xlneea.pfl24uc': Invalid argument
42 cp: failed to extend `/nerc/n02/n02/chollow/chollow/xlnee/xlneea.pfl24uc': Invalid argument
43 /work/n02/n02/chollow/xlnee/bin/qshector_arch[52]: : cannot open
44 /work/n02/n02/chollow/xlnee/bin/qshector_arch[53]: : cannot open
45 /work/n02/n02/chollow/xlnee/bin/qshector_arch[54]: : cannot open
46 ⇒> PBS: job killed: walltime 616 exceeded limit 600
47 /var/spool/PBS/mom_priv/jobs/2962278.sdb.SC[11]: .[329]: .: line 248: 17488: Terminated
48 /work/n02/n02/chollow/xlnee/bin/qsserver[515]: .: line 97: 21072: Terminated
49 ————————————————————————————————————————
50
51 Resources requested: ncpus=72,place=free,walltime=00:10:00
52 Resources allocated: cpupercent=0,cput=00:00:01,mem=19788kb,ncpus=72,vmem=224028kb,walltime=00:10:16

This is in:
/home/n02/n02/chollow/output/xlnee000.xlnee.d15170.t102547.leave

Chris

comment:8 Changed 4 years ago by grenville

Chris

Please switch off the archiving and try again - your model ran to completion (for 288 time steps.)

We never upgraded archiving for UM7.5 (probably won't) - can you manage the data transfers as a post processing step? How much data will come out of these runs?

Grenville

comment:9 Changed 4 years ago by grenville

  • Resolution set to answered
  • Status changed from new to closed

Chris

I'll close this for now.

Grenville

Note: See TracTickets for help on using tickets.