Opened 5 years ago
Closed 4 years ago
#1587 closed help (answered)
8.2 dump in 7.5 model reconfiguration job
Reported by: | chollow | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | ||
UM Version: | <select version> |
Description
Hi,
I'm testing a 7.5 reconfiguration job (for a limited-area 12-km model) for Fadzil to see if it will work for him. However, it looks like the UM start dump I am using as my input file (which was reconfigured from an ECMWF grib2 file with a special set of scripts) is version 8.2 (I assume this from the rcf. leave job, I've copied some output below). This may be why my reconfiguration is failing, although there is a more specific complaint in the leave file.
The job is xlned on ARCHER
The output file is: /home/n02/n02/chollow/output/xlned000.xlned.d15159.t182008.rcf.leave
The error is:
66 Rank 138 [Mon Jun 8 20:33:55 2015] [c3-1c0s13n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 138
67 Rank 120 [Mon Jun 8 20:33:55 2015] [c3-1c0s13n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 120
68 Error Code:- 2
69 Error Message:- Cant find required STASH item 376 section 0 model 1 in STASHmaster
70 Error generated from processor 0
I looked up this stash item (in an 8.2 job) and it is:
SNOW DEPTH ON GROUND IN TILES (M)
There are a few related ones, 377-386, which are also related to snow which I don't think are in version 7.5 (or at least not all of them).
Is there a way around this? Or is it never advisable to have your input dump be from a later model version?
Thanks,
Chris
Change History (9)
comment:1 Changed 4 years ago by grenville
comment:2 Changed 4 years ago by grenville
— should have said because it's not in your 7.5 model (xlned).
comment:3 Changed 4 years ago by chollow
The single file to ignore item 0 376 worked, thanks.
However, I now have a job with the actual run (xlnee) which uses an executable from xlneb
This run stops after about 2 seconds with a tiny leave file:
/home/n02/n02/chollow/output/xlnee000.xlnee.d15168.t152756.leave
The only error I can find is:
/var/spool/PBS/mom_priv/jobs/2959144.sdb.SC[11]: .[329]: .: UMScr_TopLevel: cannot open [No such file or directory]
I don't know what's wrong, I ran it twice but same thing.
comment:4 Changed 4 years ago by grenville
Chris
It looks like you don't have a bin directory for the xlnee job - copy over the bin directory from /work/n02/n02/chollow/xlneb into /work/n02/n02/chollow/xlnee.
Grenville
comment:5 Changed 4 years ago by chollow
I fixed the above by "enable build of UM scripts". Then after compiling scripts, switching this back off, and running again, it failed slightly later, so I turned off archiving. After this, there was one more error (too many NS and EW processors, max 10 allowed) so I changed the NS and EW processors to 8 each. After this the run ran 1 full day. Yay! My only problem is the archiving, which generated the following error:
39 lib-4536 : UNRECOVERABLE library error
40 Assign processing requires that environment variable FILENV be set.
41 Rank 11 [Wed Jun 17 17:25:53 2015] [c2-1c0s1n0] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 11
42 Rank 132 [Wed Jun 17 17:25:53 2015] [c2-1c0s6n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 132
43 Rank 143 [Wed Jun 17 17:25:53 2015] [c2-1c0s6n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 143
44 _pmiu_daemon(SIGCHLD): [NID 01947] [c2-1c0s6n3] [Wed Jun 17 17:25:53 2015] PE RANK 132 exit signal Aborted
45 [NID 01947] 2015-06-17 17:25:53 Apid 15100233: initiated application termination
46 xlnee: Run failed
This is in leave file:
/home/n02/n02/chollow/output/xlnee000.xlnee.d15168.t163345.leave
not sure why archiving didn't work, but otherwise run seems OK.
Chris
comment:6 Changed 4 years ago by grenville
Chris
Did archiving ever work?
Grenville
comment:7 Changed 4 years ago by chollow
Hi Grenville,
No, I tried ticking "yes" for archiving and re-running, it ran out of time but I don't think that was the real problem, here is the excerpt from the latest xlnee leave file:
39 lib-4536 : UNRECOVERABLE library error
40 Assign processing requires that environment variable FILENV be set.
41 cp: writing `/nerc/n02/n02/chollow/chollow/xlnee/xlneea.pfl24uc': Invalid argument
42 cp: failed to extend `/nerc/n02/n02/chollow/chollow/xlnee/xlneea.pfl24uc': Invalid argument
43 /work/n02/n02/chollow/xlnee/bin/qshector_arch[52]: : cannot open
44 /work/n02/n02/chollow/xlnee/bin/qshector_arch[53]: : cannot open
45 /work/n02/n02/chollow/xlnee/bin/qshector_arch[54]: : cannot open
46 ⇒> PBS: job killed: walltime 616 exceeded limit 600
47 /var/spool/PBS/mom_priv/jobs/2962278.sdb.SC[11]: .[329]: .: line 248: 17488: Terminated
48 /work/n02/n02/chollow/xlnee/bin/qsserver[515]: .: line 97: 21072: Terminated
49 ————————————————————————————————————————
50
51 Resources requested: ncpus=72,place=free,walltime=00:10:00
52 Resources allocated: cpupercent=0,cput=00:00:01,mem=19788kb,ncpus=72,vmem=224028kb,walltime=00:10:16
This is in:
/home/n02/n02/chollow/output/xlnee000.xlnee.d15170.t102547.leave
Chris
comment:8 Changed 4 years ago by grenville
Chris
Please switch off the archiving and try again - your model ran to completion (for 288 time steps.)
We never upgraded archiving for UM7.5 (probably won't) - can you manage the data transfers as a post processing step? How much data will come out of these runs?
Grenville
comment:9 Changed 4 years ago by grenville
- Resolution set to answered
- Status changed from new to closed
Chris
I'll close this for now.
Grenville
Chris
Try telling the model to ignore this stash item - you'll need to make a user stash file - have a look at
/home/grenville/USERSTASH/241_ignore (on PUMA). You'd need to change the Section to 0, Item to 376 — the spacing in the stash record must not be changed. Include the user stash file in the user stashmaster section of the UMUI.
You may need to do this for more than one field.
Do you have a UM 7.5 dump from your previous runs? — I'm a bit unsure why it's asking about stash item 0 376 because it's not in your 7.5 model (xlnif).
Grenville