Opened 3 years ago
Closed 3 years ago
#2409 closed help (fixed)
Model crashing with no obvious error
Reported by: | ajd | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | Monsoon2 | |
UM Version: | 10.7 |
Description
Hi CMS,
A few of my suites have crashed with no obvious error. In most cases, restarting the run by setting up an NRUN seems to work so I'm doing that, but there is at least one suite where that hasn't worked (u-at673) - for that one I had changed the cycling frequency mid-run via a reload so it could be that something went wrong with that but that was >20 model years earlier.
I have tried switching on extra diagnostics and to save all output from all processors (under um→env→Runtime Controls→Atmosphere only) but that doesn't seem to work - I still only get output from pe0. Any ideas why that wouldn't work?
Many thanks,
Andrea
Change History (7)
comment:1 Changed 3 years ago by ajd
comment:2 Changed 3 years ago by grenville
Just set the failed task to succeeded - do you need to keep validating the suite?
Grenville
comment:3 Changed 3 years ago by ajd
Thanks Grenville - I guess I was just wondering why something that worked before didn't anymore, but setting to succeeded works.
Do you know why the switch to save output from all processors doesn't work though?
Thanks, Andrea
comment:4 Changed 3 years ago by grenville
Andrea
The suite validation gets data from an external source so relies on network etc and the source itself
I'm still working on pe output
Grenville
comment:5 Changed 3 years ago by ajd
Makes sense, thanks!
comment:6 Changed 3 years ago by grenville
Hi Andrea
The error is with NEMO (see /home/d02/andit/cylc-run/u-at673/work/19131001T0000Z/coupled/ocean.output).
==⇒>> : E R R O R
===========
stpctl: the zonal velocity is larger than 20 m/s
======
kt= 22180 max abs(U): 4.6520E+05, i j k: 118 289 10
output of last fields in numwso
==⇒>> : E R R O R
===========
There is scant documentation about why this happens, other than to say it's not sensitive to complier not time step.
The developers of your model may know more - are you in contact with them?
Grenville
comment:7 Changed 3 years ago by willie
- Resolution set to fixed
- Status changed from new to closed
Hello again,
Further to this, the suites I am trying to restart now fail before running the model in the validate_suite_info task:
Exception: Retrieval of controlled vocabulary failed. Please check carefully that the section ("experiment_id") and revision ("3.2.5") are both valid.
command: curl -s -f https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/3.2.5/CMIP6_experiment_id.json
return code: 22
I don't really understand why this is now happening (never has before for the same setup) and why I don't see any useful error messages for the failed coupled tasks (before the restarts)?
Have you seen this before?
Many thanks,
Andrea