#1743 closed help (answered)

Persistent seg fault with MetUM-GOML

Reported by: l.j.wilcox Owned by: um_support
Priority: normal Component: UM Model
Keywords: KPP GOML Cc: swr05npk
Platform: ARCHER UM Version: 7.8



I'm running GOML on Archer, and my simulation has started to fail. I've tried going back to a point in the simulation before the failure, reconfiguring, and running again, but the model keeps reaching the same point and failing. I'm not sure what else to try, or what the error means, other than it looks quite bad:

_pmiu_daemon(SIGCHLD): [NID 02954] [c7-1c1s2n2] [Wed Nov 25 19:22:36 2015] PE RANK 7 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 02956] [c7-1c1s3n0] [Wed Nov 25 19:22:36 2015] PE RANK 25 exit signal Segmentation fault
[NID 02954] 2015-11-25 19:22:36 Apid 18833072: initiated application termination

Any suggestions you might have for how to proceed would be much appreciated!

The jobid is xkukj.
The most recent .leave file, and the .leave for the associated reconfiguration are:

Thanks for your help,

Change History (5)

comment:1 Changed 17 months ago by grenville


Please switch on ATP — go to model selection→input/output…→script insert… and add

ATP_ENABLED to 1 in the table.

Re run - this should give us some clues.

There appears to be a big problem with xkuki too - you could try the same in that job.


comment:2 Changed 17 months ago by l.j.wilcox

Hi Grenville,

I've rerun xkukj and xkuki. The new .leave files are:

There's definitely more information in there now, but I'm not sure how to interpret it.


comment:3 Changed 17 months ago by grenville

  • Cc swr05npk added
  • Keywords KPP GOML added


It looks like 2 different problems


Dan has answered (added here for our records)

Looking at /work/n02/n02/laura/um/xkuki/dataw , I noticed

(using ~lrdlrh/bin/kpp_check.pl) that you are missing
1000m_temps.nc - could this be the issue? The .leave file
does say "No such file or directory", before the crash..


in the .leave file - which suggests it might indeed be the 1000m_temps.nc file..

Hi Dan,

You're right, it is missing!

And it's in my directory here with all the output files… Time to write
a new fetch script!

Awesome - thanks Dan. Mind if I pinch your script for the next time I
accidentally delete something?


The problem here is different - the leave file says

ATP Stack walkback for Rank 7 starting:


bi_linear_h is a trap for lots of errors and is indicative of a model failure - your model has blown up at timestep 1689

Atm_Step: Timestep 1689

initial Absolute Norm : 7088.1740880339757
GCR( 2 ) failed to converge in 100 iterations.
Final Absolute Norm : 0.71472370097877336

Atm_Step: Timestep 1690

initial Absolute Norm : 4834.8946001526556
GCR( 2 ) converged in 7 iterations.
Final Absolute Norm : NaN

and now its got NaNs?

I suggest telling it to output dumps at time steps 1685,86,87,88 - and have a look at some prognostics.

I'm not sure how to do that in your coupled model 'though.


comment:4 Changed 17 months ago by grenville


This may simply be a case of switching off climate meaning, checking the irregular dump times button and in the dumping and meaning panel adding the list of time steps you want dumps.


comment:5 Changed 15 months ago by ros

  • Resolution set to answered
  • Status changed from new to closed

Closed due to lack of activity.

Note: See TracTickets for help on using tickets.