Opened 2 years ago
Closed 16 months ago
#2682 closed error (answered)
Model fails to write temp_hist.0058
Reported by: | swr05npk | Owned by: | grenville |
---|---|---|---|
Component: | UM Model | Keywords: | temporary,history,file |
Cc: | Platform: | ARCHER | |
UM Version: | 10.3 |
Description
With suite u-at580, I tried to run a MetUM-GOML simulation for 1.5 years at N216 in the ARCHER "long" queue. Twice in a row, the atmosphere has crashed after exactly nine months and 20 days (290 days) because the UM fails to write the temporary history file "temp_hist.0058":
???????????????????????????????????????????????????????????????????????????????? ???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!! ? Error code: 4013 ? Error from routine: U_MODEL_4A ? Error message: Temphist: Failed in OPEN of history file ? Error from processor: 0 ? Error number: 20561 ???????????????????????????????????????????????????????????????????????????????? [0] exceptions: An non-exception application exit occured. [0] exceptions: whilst in a serial region [0] exceptions: Task had pid=45749 on host nid01939 [0] exceptions: Program is "toyatm"
This cannot be a coincidence. I have plenty of disk space, so it's not a quota problem. The model writes the first 57 temporary history files correctly, so why not the 58th file?
In the meantime, I guess I'm restricted to running simulations in nine month steps.
Full log output:
/home/n02/n02/pappas/work/cylc-run/u-at580/log.20181121T120658Z/job/20020601T0000Z/coupled/01/job.err
Thanks,
Nick
Change History (8)
comment:1 Changed 2 years ago by swr05npk
comment:2 Changed 2 years ago by willie
Hi Nick,
There is something very wrong with some of the file names. If you do ls -lthrb in the coupled directory you get
: -rw-r--r-- 1 pappas n02 556M Nov 20 20:56 KPP.restart.00515\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ .1 -rw-r--r-- 1 pappas n02 685M Nov 20 20:56 KPP.restart.00515\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ .2 -rw-r--r-- 1 pappas n02 2.8K Nov 21 11:54 KPPout85 :
Willie
comment:3 Changed 2 years ago by willie
Hii Nick
This is related to the variable KPP_RESTART_CYCLE,
./suite.rc: KPP_RESTART_FILE = '${ROSE_DATA}/{{KPP_RST}}/KPP.restart.${KPP_START_CYCLE}' ./suite.rc: ln -sf ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE_PAD}.1 ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE}.1 ; ./suite.rc: ln -sf ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE_PAD}.2 ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE}.2"""
I notice that prior to the error above you are getting
readlink: invalid option -- 'N'
which is a result of ln -sf on the strange name.
Willie
comment:4 Changed 2 years ago by grenville
Nick
The model fails to open history_archive/temp_hist.0058 because it's trying to do so on unit 100 - by default units 100-102 are reserved for stdin, out,err.
It's odd that we've not seen this before. We're looking for a simple fix.
gGrenville
comment:5 Changed 2 years ago by willie
Hi Nick,
In your suite .rc file,you have the commands
ln -sf ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE_PAD}.1 ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE}.1 ln -sf ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE_PAD}.2 ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE}.2"
but the link command is
ln -sf target link_name
so the arguments in suite.rc are round the wrong way. If successful, you should have four files beginning KPP.restart, two being links. Currently you have only two and neither are links.
Also in the bin/set_restart_five.ksh
ln -sf ${directory}/KPP.restart.${my_startt} ${directory}/' fort.30'
puts a space in the directory name which can cause problems later if you do not backslash the space.
Willie
comment:6 Changed 2 years ago by swr05npk
Hi Willie,
The symbolic links work fine. If you're looking at the restart files 'KPP.restart.51*', they're left over from some failed runs during development. The rest of the links work well, eg
lrwxrwxrwx 1 pappas n02 89 Nov 26 01:11 KPP.restart.1860.2 -> /work/n02/n02/pappas/cylc-run/u-at580/share/data/History_Data/KPPhist/KPP.restart.01860.2 lrwxrwxrwx 1 pappas n02 89 Nov 26 01:11 KPP.restart.1860.1 -> /work/n02/n02/pappas/cylc-run/u-at580/share/data/History_Data/KPPhist/KPP.restart.01860.1
I made those links because I couldn't work out how to get Rose to zero-pad the initial time in the namelist for my ocean model. The model expects to read a file called 'KPP.restart.[start]', where [start] is the same as the start time in the model. However, it writes out restart files with zero-padded time codes, like '01860' in the example above. The links are just my work around for the zero padding issue.
This suite does not use 'set_restart_five.ksh'. The space is actually in the filename.
Cheers,
Nick
comment:7 Changed 2 years ago by willie
- Owner changed from um_support to grenville
- Status changed from new to assigned
comment:8 Changed 16 months ago by ros
- Resolution set to answered
- Status changed from assigned to closed
More output from job.out: