#2325 closed help (fixed)

HadGEM2-AMIP crashing

Reported by: peterh Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 6.6.3

Description

Dear cms-support,

I have a set of AMIP HadGEM2 runs that are crashing after 2-4 months of run time with a segmentation fault. I cannot find the cause of the problem. I have tried continuing from the last complete model dump, changing the dump frequency and changing to alternative SST/ice ancils.

I noticed on this page: http://cms.ncas.ac.uk/wiki/Archer/cce8.3.7&gws_rd=cr&dcr=0&ei=XZ8WWumjMMWN0gXhj4igCg which is currently down the following related to the choice of compiler:

6.6.3 HadGEM2-AMIP xlnic xjgcc

runs OK built with -g - blows up somewhere with 8.2.1 settings — large_scale_cloud::ls_arcld is the culprit; build this with -O0 and the model runs OK

Should I be setting the O0 option for large_scale_cloud_ls_arcld? If so, can you let me know how I do this?

Many thanks,
Peter


Change History (21)

comment:1 Changed 21 months ago by grenville

Started this one through email - here's the gist

Its worth adding ATP_ENABLED 1 (in script inserts and modifications) to get a stack trace when the model fails & switch on increased logging(Output Choices→ Extra diagnostic..)

Quick and dirty approach - you could make a change on ARCHER in /work/n02/n02/ggpoh/xnteb/ummodel/cfg/bld.cfg - add

tool::fflags::UM::atmosphere::large_scale_cloud::ls_arcld -e m -h noomp -s real64 -s integer64 -O1,fp1 -hflex_mp=intolerant -I/work/n02/n02/hum/gcom/cce/gcom3.8/archer_cce_mpp/inc -I/work/n02/n02/hum/gcom/cce/gcom3.8/archer_cce_mpp/obj -O0 -Ofp0

Then type fcm build.

This will build a new exe in /work/n02/n02/ggpoh/xnteb/ummodel/bin — so make sure you point to this in your umui job, or move it.

The final -O0 -Ofp0 takes precedence.

comment:2 Changed 21 months ago by peterh

I have now spent far too long on this!

Increasing the max number of iterations in the
dynamics solver section didn't help.

Reducing the timestep to 15 minutes appeared to help, but this slows down the model
and wasn't required last time I ran HadGEM2-A.

Instead I tried changing both
tool::fflags from O1 to O0,
and
tool::cflags from -O3 to -O2 or -O1
in ummodel/cfg/bld.cfg on ARCHER,
but neither has worked.

In the end I traced this to some changes in the STASH, but I haven't found the exact cause of the error. The only sign of anything wrong, is the surface level u-wind speed in the .da files shows unexpected rectangular shapes in the plot …

For future reference the failing jobs are xntea and xnteb. Jobs xnten and xnteq are working fine.

I think we can close the ticket now. Thanks for your help.

comment:3 Changed 21 months ago by peterh

I'm still having problems with HadGEM2-A runs.

Could you take look at xntet compared to xnteq?

xntet is crashing but the only difference from xnteq, is that it has on fcm update included: https://puma.nerc.ac.uk/trac/UM/changeset?reponame=&new=22397%40UM%2Fbranches%2Fdev%2Fpeterh%2Fhg6.6.3_dust_divs_1_3_only&old=2580%40UM%2Fbranches%2Fpkg%2FConfig%2FHadGEM2-ES

this only adds a few commented lines to one subroutine (dust_srce.F90).

The output is showing blocky fields in the surface level winds in the .da file and is failing to converge in 50 interations before crashing.

I have no idea why this is happening. I've also found that adding almost any other fcm update causes similar problems resulting in seg faults.

Many thanks.

comment:4 Changed 21 months ago by grenville

Peter

Where are the leave files for xntet?

Grenville

comment:5 Changed 21 months ago by grenville

Never mind - found them

comment:6 Changed 21 months ago by grenville

Peter

I can't get your configuration to run even with fcm:um_br/dev/peterh/hg6.6.3_dust_divs_1_3_only/src switched off.

I can only suggest that the problem lies in one of the other branches - without some domain knowledge, I can't decide which to include or not.

Have you included new ancillary data in these runs.

Grenville

comment:7 Changed 21 months ago by peterh

OK I've only changed SSTs/sea-ice, but in xntex I change these back to AMIP ones distributed with HadGEM2-ES on puma/ARCHER. This results in the same error.

If I'm looking at the correct job, you're getting the same error as me. It's running for a bit, then failing to converge and crashing in bi_linear_h.

I'll try re-compiling with each of my 2 other fcm updates removed 1 at a time, and let you know how these jobs go.
Thanks again.

Last edited 21 months ago by peterh (previous) (diff)

comment:8 Changed 21 months ago by peterh

The latest job is xntez. In this I've removed all of my own fcm updates except the last one (which doesn't change the code).

The model uses the distributed AMIP SSTs and sea-ice, and is otherwise identical to atmosphere-only jobs I successfully ran about 1 year ago and before.

It crashes and shows odd looking blocky fields in the surface u-level winds in the .da files.

Do you have any idea what is going wrong?
Thanks

comment:9 Changed 21 months ago by grenville

Peter

What is the id of the successful counterpart — are there leave files to look at?

Grenville

comment:10 Changed 21 months ago by peterh

The successful job is xnteq, and the leave file is ~ggpoh/um/umui_out/xnteq000.xnteq.d17334.t152540.leave on ARCHER.

Thanks.

comment:11 Changed 21 months ago by peterh

Hi, I've rebuilt my job up from another one that worked (for a different time period). Now I'm not getting any of these strange crashes or errors in the wind fields. So please don't worry about looking into this any further.

When I get to the bottom of the problems with these jobs, I'll post it here, in case it's useful for anyone else.

Thanks again.

comment:12 Changed 21 months ago by grenville

This is bizarre - what is the id for your latest job?

comment:13 Changed 21 months ago by peterh

My new working job is xnuzg.

However, this still has problems when I add 2 fcm updates (xnuzh and xnuzj), again with blocky u-wind fields in the .da files. Perhaps the blocky fields are the model capping the wind speeds?

I thought about trying to run these crashing jobs with halved timestep for a year to stabilise them, and then continue with the normal timestep?

comment:14 Changed 21 months ago by grenville

Peter

We think there is some bad memory management in here - the blocks match processor boundaries which might indicate indexing problems. Odd that we've not seen this before, but you may be the only user of the AMIP run (since HECRoR).

Grenville

comment:15 Changed 21 months ago by peterh

Might this explain why I didn't have any problems with a different time period (and hence land sea mask)?

If memory management is the problem, what would the solution be?

comment:16 Changed 21 months ago by grenville

Peter

Simon pointed out that your successful job ran with an executable built with <2A> Interactive vegetation distribution — I built with this setting and the model did not exhibit the corrupted U-wind. Please try that - I don't know your reasons for switching Vegetation schemes 'though?

Grenville

comment:17 Changed 21 months ago by peterh

I have tried activating <2A> Interactive vegetation and still get problems (xnuzw). I've also tried 1 node instead of 4 (xnuzv), and get a different pattern in the u-wind, but it is still blocky.

I thought now I could try the coupled model, as this was run before with some of the settings I'm aiming for and worked fine.

comment:18 Changed 21 months ago by grenville

You need to rebuild - did you do that?

comment:19 Changed 21 months ago by peterh

I didn't realise that a rebuild is required

I'm trying this now. A first test shows the u-wind fields looking ok and the 'failed to converge' errors no longer occurring. I will post back when I've tried with the different fcm updates that I need to include.

Thanks!

comment:20 Changed 21 months ago by peterh

All of these jobs are working fine now. Instead of de-activating the interactive vegetation, I recompiled and included the hand edit triff_never.sh.

Thanks again for your help!

comment:21 Changed 21 months ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.