Opened 4 years ago

Closed 4 years ago

#2325 closed help (fixed)

HadGEM2-AMIP crashing

Reported by: peterh Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 6.6.3


Dear cms-support,

I have a set of AMIP HadGEM2 runs that are crashing after 2-4 months of run time with a segmentation fault. I cannot find the cause of the problem. I have tried continuing from the last complete model dump, changing the dump frequency and changing to alternative SST/ice ancils.

I noticed on this page: which is currently down the following related to the choice of compiler:

6.6.3 HadGEM2-AMIP xlnic xjgcc

runs OK built with -g - blows up somewhere with 8.2.1 settings — large_scale_cloud::ls_arcld is the culprit; build this with -O0 and the model runs OK

Should I be setting the O0 option for large_scale_cloud_ls_arcld? If so, can you let me know how I do this?

Many thanks,

Change History (21)

comment:1 Changed 4 years ago by grenville

Started this one through email - here's the gist

Its worth adding ATP_ENABLED 1 (in script inserts and modifications) to get a stack trace when the model fails & switch on increased logging(Output Choices→ Extra diagnostic..)

Quick and dirty approach - you could make a change on ARCHER in /work/n02/n02/ggpoh/xnteb/ummodel/cfg/bld.cfg - add

tool::fflags::UM::atmosphere::large_scale_cloud::ls_arcld -e m -h noomp -s real64 -s integer64 -O1,fp1 -hflex_mp=intolerant -I/work/n02/n02/hum/gcom/cce/gcom3.8/archer_cce_mpp/inc -I/work/n02/n02/hum/gcom/cce/gcom3.8/archer_cce_mpp/obj -O0 -Ofp0

Then type fcm build.

This will build a new exe in /work/n02/n02/ggpoh/xnteb/ummodel/bin — so make sure you point to this in your umui job, or move it.

The final -O0 -Ofp0 takes precedence.

comment:2 Changed 4 years ago by peterh

I have now spent far too long on this!

Increasing the max number of iterations in the
dynamics solver section didn't help.

Reducing the timestep to 15 minutes appeared to help, but this slows down the model
and wasn't required last time I ran HadGEM2-A.

Instead I tried changing both
tool::fflags from O1 to O0,
tool::cflags from -O3 to -O2 or -O1
in ummodel/cfg/bld.cfg on ARCHER,
but neither has worked.

In the end I traced this to some changes in the STASH, but I haven't found the exact cause of the error. The only sign of anything wrong, is the surface level u-wind speed in the .da files shows unexpected rectangular shapes in the plot …

For future reference the failing jobs are xntea and xnteb. Jobs xnten and xnteq are working fine.

I think we can close the ticket now. Thanks for your help.

comment:3 Changed 4 years ago by peterh

I'm still having problems with HadGEM2-A runs.

Could you take look at xntet compared to xnteq?

xntet is crashing but the only difference from xnteq, is that it has on fcm update included:

this only adds a few commented lines to one subroutine (dust_srce.F90).

The output is showing blocky fields in the surface level winds in the .da file and is failing to converge in 50 interations before crashing.

I have no idea why this is happening. I've also found that adding almost any other fcm update causes similar problems resulting in seg faults.

Many thanks.

comment:4 Changed 4 years ago by grenville


Where are the leave files for xntet?


comment:5 Changed 4 years ago by grenville

Never mind - found them

comment:6 Changed 4 years ago by grenville


I can't get your configuration to run even with fcm:um_br/dev/peterh/hg6.6.3_dust_divs_1_3_only/src switched off.

I can only suggest that the problem lies in one of the other branches - without some domain knowledge, I can't decide which to include or not.

Have you included new ancillary data in these runs.


comment:7 Changed 4 years ago by peterh

OK I've only changed SSTs/sea-ice, but in xntex I change these back to AMIP ones distributed with HadGEM2-ES on puma/ARCHER. This results in the same error.

If I'm looking at the correct job, you're getting the same error as me. It's running for a bit, then failing to converge and crashing in bi_linear_h.

I'll try re-compiling with each of my 2 other fcm updates removed 1 at a time, and let you know how these jobs go.
Thanks again.

Last edited 4 years ago by peterh (previous) (diff)

comment:8 Changed 4 years ago by peterh

The latest job is xntez. In this I've removed all of my own fcm updates except the last one (which doesn't change the code).

The model uses the distributed AMIP SSTs and sea-ice, and is otherwise identical to atmosphere-only jobs I successfully ran about 1 year ago and before.

It crashes and shows odd looking blocky fields in the surface u-level winds in the .da files.

Do you have any idea what is going wrong?

comment:9 Changed 4 years ago by grenville


What is the id of the successful counterpart — are there leave files to look at?


comment:10 Changed 4 years ago by peterh

The successful job is xnteq, and the leave file is ~ggpoh/um/umui_out/xnteq000.xnteq.d17334.t152540.leave on ARCHER.


comment:11 Changed 4 years ago by peterh

Hi, I've rebuilt my job up from another one that worked (for a different time period). Now I'm not getting any of these strange crashes or errors in the wind fields. So please don't worry about looking into this any further.

When I get to the bottom of the problems with these jobs, I'll post it here, in case it's useful for anyone else.

Thanks again.

comment:12 Changed 4 years ago by grenville

This is bizarre - what is the id for your latest job?

comment:13 Changed 4 years ago by peterh

My new working job is xnuzg.

However, this still has problems when I add 2 fcm updates (xnuzh and xnuzj), again with blocky u-wind fields in the .da files. Perhaps the blocky fields are the model capping the wind speeds?

I thought about trying to run these crashing jobs with halved timestep for a year to stabilise them, and then continue with the normal timestep?

comment:14 Changed 4 years ago by grenville


We think there is some bad memory management in here - the blocks match processor boundaries which might indicate indexing problems. Odd that we've not seen this before, but you may be the only user of the AMIP run (since HECRoR).


comment:15 Changed 4 years ago by peterh

Might this explain why I didn't have any problems with a different time period (and hence land sea mask)?

If memory management is the problem, what would the solution be?

comment:16 Changed 4 years ago by grenville


Simon pointed out that your successful job ran with an executable built with <2A> Interactive vegetation distribution — I built with this setting and the model did not exhibit the corrupted U-wind. Please try that - I don't know your reasons for switching Vegetation schemes 'though?


comment:17 Changed 4 years ago by peterh

I have tried activating <2A> Interactive vegetation and still get problems (xnuzw). I've also tried 1 node instead of 4 (xnuzv), and get a different pattern in the u-wind, but it is still blocky.

I thought now I could try the coupled model, as this was run before with some of the settings I'm aiming for and worked fine.

comment:18 Changed 4 years ago by grenville

You need to rebuild - did you do that?

comment:19 Changed 4 years ago by peterh

I didn't realise that a rebuild is required

I'm trying this now. A first test shows the u-wind fields looking ok and the 'failed to converge' errors no longer occurring. I will post back when I've tried with the different fcm updates that I need to include.


comment:20 Changed 4 years ago by peterh

All of these jobs are working fine now. Instead of de-activating the interactive vegetation, I recompiled and included the hand edit

Thanks again for your help!

comment:21 Changed 4 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.