Opened 4 years ago
Closed 4 years ago
#1743 closed help (answered)
Persistent seg fault with MetUM-GOML
Reported by: | l.j.wilcox | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | KPP GOML |
Cc: | swr05npk | Platform: | ARCHER |
UM Version: | 7.8 |
Description
Hi CMS,
I'm running GOML on Archer, and my simulation has started to fail. I've tried going back to a point in the simulation before the failure, reconfiguring, and running again, but the model keeps reaching the same point and failing. I'm not sure what else to try, or what the error means, other than it looks quite bad:
_pmiu_daemon(SIGCHLD): [NID 02954] [c7-1c1s2n2] [Wed Nov 25 19:22:36 2015] PE RANK 7 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 02956] [c7-1c1s3n0] [Wed Nov 25 19:22:36 2015] PE RANK 25 exit signal Segmentation fault
[NID 02954] 2015-11-25 19:22:36 Apid 18833072: initiated application termination
Any suggestions you might have for how to proceed would be much appreciated!
The jobid is xkukj.
The most recent .leave file, and the .leave for the associated reconfiguration are:
/home/n02/n02/laura/um/umui_out/xkukj000.xkukj.d15329.t103034.leave
/home/n02/n02/laura/um/umui_out/xkukj000.xkukj.d15329.t103034.rcf.leave
Thanks for your help,
Laura
Change History (5)
comment:1 Changed 4 years ago by grenville
comment:2 Changed 4 years ago by l.j.wilcox
Hi Grenville,
I've rerun xkukj and xkuki. The new .leave files are:
/home/n02/n02/laura/um/umui_out/xkukj000.xkukj.d15330.t165616.leave
/home/n02/n02/laura/um/umui_out/xkuki000.xkuki.d15330.t165817.leave
There's definitely more information in there now, but I'm not sure how to interpret it.
Thanks,
Laura
comment:3 Changed 4 years ago by grenville
- Cc swr05npk added
- Keywords KPP GOML added
Laura
It looks like 2 different problems
xkuki
Dan has answered (added here for our records)
Looking at /work/n02/n02/laura/um/xkuki/dataw , I noticed
(using ~lrdlrh/bin/kpp_check.pl) that you are missing
1000m_temps.nc - could this be the issue? The .leave file
does say "No such file or directory", before the crash..
read_bottom_temp_@…:1859
in the .leave file - which suggests it might indeed be the 1000m_temps.nc file..
Hi Dan,
You're right, it is missing!
And it's in my directory here with all the output files… Time to write
a new fetch script!
Awesome - thanks Dan. Mind if I pinch your script for the next time I
accidentally delete something?
xkukj
The problem here is different - the leave file says
ATP Stack walkback for Rank 7 starting:
bi_linear_h_@…:585
bi_linear_h is a trap for lots of errors and is indicative of a model failure - your model has blown up at timestep 1689
Atm_Step: Timestep 1689
==============================================
initial Absolute Norm : 7088.1740880339757
GCR( 2 ) failed to converge in 100 iterations.
Final Absolute Norm : 0.71472370097877336
==============================================
Atm_Step: Timestep 1690
==============================================
initial Absolute Norm : 4834.8946001526556
GCR( 2 ) converged in 7 iterations.
Final Absolute Norm : NaN
and now its got NaNs?
I suggest telling it to output dumps at time steps 1685,86,87,88 - and have a look at some prognostics.
I'm not sure how to do that in your coupled model 'though.
Grenville
comment:4 Changed 4 years ago by grenville
Laura
This may simply be a case of switching off climate meaning, checking the irregular dump times button and in the dumping and meaning panel adding the list of time steps you want dumps.
Grenville
comment:5 Changed 4 years ago by ros
- Resolution set to answered
- Status changed from new to closed
Closed due to lack of activity.
Laura
Please switch on ATP — go to model selection→input/output…→script insert… and add
ATP_ENABLED to 1 in the table.
Re run - this should give us some clues.
There appears to be a big problem with xkuki too - you could try the same in that job.
Grenville