#1743 closed help (answered)

Persistent seg fault with MetUM-GOML

Reported by: l.j.wilcox Owned by: um_support
Priority: normal Component: UM Model
Keywords: KPP GOML Cc: swr05npk
Platform: ARCHER UM Version: 7.8

Description

Hi CMS,

I'm running GOML on Archer, and my simulation has started to fail. I've tried going back to a point in the simulation before the failure, reconfiguring, and running again, but the model keeps reaching the same point and failing. I'm not sure what else to try, or what the error means, other than it looks quite bad:

_pmiu_daemon(SIGCHLD): [NID 02954] [c7-1c1s2n2] [Wed Nov 25 19:22:36 2015] PE RANK 7 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 02956] [c7-1c1s3n0] [Wed Nov 25 19:22:36 2015] PE RANK 25 exit signal Segmentation fault
[NID 02954] 2015-11-25 19:22:36 Apid 18833072: initiated application termination

Any suggestions you might have for how to proceed would be much appreciated!

The jobid is xkukj.
The most recent .leave file, and the .leave for the associated reconfiguration are:
/home/n02/n02/laura/um/umui_out/xkukj000.xkukj.d15329.t103034.leave
/home/n02/n02/laura/um/umui_out/xkukj000.xkukj.d15329.t103034.rcf.leave

Thanks for your help,
Laura

Change History (5)

comment:1 Changed 18 months ago by grenville

Laura

Please switch on ATP — go to model selection→input/output…→script insert… and add

ATP_ENABLED to 1 in the table.

Re run - this should give us some clues.

There appears to be a big problem with xkuki too - you could try the same in that job.

Grenville

comment:2 Changed 18 months ago by l.j.wilcox

Hi Grenville,

I've rerun xkukj and xkuki. The new .leave files are:
/home/n02/n02/laura/um/umui_out/xkukj000.xkukj.d15330.t165616.leave
/home/n02/n02/laura/um/umui_out/xkuki000.xkuki.d15330.t165817.leave

There's definitely more information in there now, but I'm not sure how to interpret it.

Thanks,
Laura

comment:3 Changed 18 months ago by grenville

  • Cc swr05npk added
  • Keywords KPP GOML added

Laura

It looks like 2 different problems

xkuki

Dan has answered (added here for our records)

Looking at /work/n02/n02/laura/um/xkuki/dataw , I noticed

(using ~lrdlrh/bin/kpp_check.pl) that you are missing
1000m_temps.nc - could this be the issue? The .leave file
does say "No such file or directory", before the crash..

read_bottom_temp_@…:1859

in the .leave file - which suggests it might indeed be the 1000m_temps.nc file..

Hi Dan,

You're right, it is missing!

And it's in my directory here with all the output files… Time to write
a new fetch script!

Awesome - thanks Dan. Mind if I pinch your script for the next time I
accidentally delete something?

xkukj

The problem here is different - the leave file says

ATP Stack walkback for Rank 7 starting:

bi_linear_h_@…:585

bi_linear_h is a trap for lots of errors and is indicative of a model failure - your model has blown up at timestep 1689

Atm_Step: Timestep 1689

==============================================
initial Absolute Norm : 7088.1740880339757
GCR( 2 ) failed to converge in 100 iterations.
Final Absolute Norm : 0.71472370097877336
==============================================

Atm_Step: Timestep 1690

==============================================
initial Absolute Norm : 4834.8946001526556
GCR( 2 ) converged in 7 iterations.
Final Absolute Norm : NaN

and now its got NaNs?

I suggest telling it to output dumps at time steps 1685,86,87,88 - and have a look at some prognostics.

I'm not sure how to do that in your coupled model 'though.

Grenville

comment:4 Changed 18 months ago by grenville

Laura

This may simply be a case of switching off climate meaning, checking the irregular dump times button and in the dumping and meaning panel adding the list of time steps you want dumps.

Grenville

comment:5 Changed 16 months ago by ros

  • Resolution set to answered
  • Status changed from new to closed

Closed due to lack of activity.

Note: See TracTickets for help on using tickets.