Opened 4 months ago

Last modified 4 weeks ago

#2682 assigned error

Model fails to write temp_hist.0058

Reported by: swr05npk Owned by: grenville
Priority: normal Component: UM Model
Keywords: temporary,history,file Cc:
Platform: ARCHER UM Version: 10.3

Description

With suite u-at580, I tried to run a MetUM-GOML simulation for 1.5 years at N216 in the ARCHER "long" queue. Twice in a row, the atmosphere has crashed after exactly nine months and 20 days (290 days) because the UM fails to write the temporary history file "temp_hist.0058":

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 4013
?  Error from routine: U_MODEL_4A
?  Error message: Temphist: Failed in OPEN of history file
?  Error from processor: 0
?  Error number: 20561
????????????????????????????????????????????????????????????????????????????????

[0] exceptions: An non-exception application exit occured.
[0] exceptions: whilst in a serial region
[0] exceptions: Task had pid=45749 on host nid01939
[0] exceptions: Program is "toyatm"

This cannot be a coincidence. I have plenty of disk space, so it's not a quota problem. The model writes the first 57 temporary history files correctly, so why not the 58th file?

In the meantime, I guess I'm restricted to running simulations in nine month steps.

Full log output:
/home/n02/n02/pappas/work/cylc-run/u-at580/log.20181121T120658Z/job/20020601T0000Z/coupled/01/job.err

Thanks,
Nick

Change History (7)

comment:1 Changed 4 months ago by swr05npk

More output from job.out:

FILE_MANAGER: Assigned : history_archive/temp_hist.0058
FILE_MANAGER:          : Unit : 100 (fortran)
U_MODEL_4A:Failure writing temporary restart file
Check for problems and restart from main file

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 4013
?  Error from routine: U_MODEL_4A
?  Error message: Temphist: Failed in OPEN of history file
?  Error from processor: 0
?  Error number: 20561
????????????????????????????????????????????????????????????????????????????????

Could not find NEMO output file: ocean.output
Could not find NEMO solver file: solver.stat
--------------------------------------------------------------------------------

Resources requested: ncpus=744,place=free,walltime=48:00:00
Resources allocated: cpupercent=1,cput=00:00:46,mem=230124kb,ncpus=744,vmem=926420kb,walltime=20:38:37

comment:2 Changed 4 months ago by willie

Hi Nick,

There is something very wrong with some of the file names. If you do ls -lthrb in the coupled directory you get

:
-rw-r--r-- 1 pappas n02 556M Nov 20 20:56 KPP.restart.00515\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ .1
-rw-r--r-- 1 pappas n02 685M Nov 20 20:56 KPP.restart.00515\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ .2
-rw-r--r-- 1 pappas n02 2.8K Nov 21 11:54 KPPout85
:

Willie

comment:3 Changed 4 months ago by willie

Hii Nick

This is related to the variable KPP_RESTART_CYCLE,

./suite.rc:            KPP_RESTART_FILE = '${ROSE_DATA}/{{KPP_RST}}/KPP.restart.${KPP_START_CYCLE}'
./suite.rc:                        ln -sf ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE_PAD}.1 ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE}.1 ; 
./suite.rc:                        ln -sf ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE_PAD}.2 ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE}.2"""

I notice that prior to the error above you are getting

readlink: invalid option -- 'N'

which is a result of ln -sf on the strange name.

Willie

comment:4 Changed 4 months ago by grenville

Nick

The model fails to open history_archive/temp_hist.0058 because it's trying to do so on unit 100 - by default units 100-102 are reserved for stdin, out,err.

It's odd that we've not seen this before. We're looking for a simple fix.

gGrenville

comment:5 Changed 4 months ago by willie

Hi Nick,

In your suite .rc file,you have the commands

ln -sf ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE_PAD}.1 ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE}.1
 
ln -sf ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE_PAD}.2 ${ROSE_DATA}'/'${KPP_RST}'/KPP.restart.'${KPP_START_CYCLE}.2"

but the link command is

ln -sf target link_name

so the arguments in suite.rc are round the wrong way. If successful, you should have four files beginning KPP.restart, two being links. Currently you have only two and neither are links.

Also in the bin/set_restart_five.ksh

ln -sf ${directory}/KPP.restart.${my_startt} ${directory}/' fort.30'

puts a space in the directory name which can cause problems later if you do not backslash the space.

Willie

comment:6 Changed 4 months ago by swr05npk

Hi Willie,

The symbolic links work fine. If you're looking at the restart files 'KPP.restart.51*', they're left over from some failed runs during development. The rest of the links work well, eg

lrwxrwxrwx 1 pappas n02        89 Nov 26 01:11 KPP.restart.1860.2 -> /work/n02/n02/pappas/cylc-run/u-at580/share/data/History_Data/KPPhist/KPP.restart.01860.2
lrwxrwxrwx 1 pappas n02        89 Nov 26 01:11 KPP.restart.1860.1 -> /work/n02/n02/pappas/cylc-run/u-at580/share/data/History_Data/KPPhist/KPP.restart.01860.1

I made those links because I couldn't work out how to get Rose to zero-pad the initial time in the namelist for my ocean model. The model expects to read a file called 'KPP.restart.[start]', where [start] is the same as the start time in the model. However, it writes out restart files with zero-padded time codes, like '01860' in the example above. The links are just my work around for the zero padding issue.

This suite does not use 'set_restart_five.ksh'. The space is actually in the filename.

Cheers,
Nick

comment:7 Changed 4 weeks ago by willie

  • Owner changed from um_support to grenville
  • Status changed from new to assigned
Note: See TracTickets for help on using tickets.