Opened 9 years ago

Closed 9 years ago

#669 closed help (fixed)

Aquaplanet HadGSM1 crash - removing final land points

Reported by: sws06djb Owned by: um_support
Component: UM Model Keywords:
Cc: d.j.brayshaw@… Platform:
UM Version: 6.1

Description

Hello,

I'm currently setting up some idealised slab-ocean experiments in HadGAM1. As part of this I wish to remove all land-points from the model, leaving a pure aquaplanet.

I currently have a configuration working that has a single row of land-points at the south pole (xghwa) but when I remove this last row (job xghwb) the model just stops - presumably crashes - without any obvious explanation of the problem. Both these runs are "calibration" runs for the slab (rather than "control" runs). As far as I can tell, the model ran a few timesteps before crashing (the output pp files are larger than just their headers).

Removing this last row of points is quite important for the study I am planning as it messes up the energy transport/balance calculations I need to perform. Please can you suggest any ways to fix or debug this problem - the contents of the umui_out file are quite uninformative (to me at least).

Thanks,

David

Change History (18)

comment:1 Changed 9 years ago by willie

Hi David,

The difference between jobs xghwa and xghwb is that the latter has zero land points (instead of 192) and you have changed some of the ancillary files. You have not changed the start dump. Looking at the snow amount field there appears to be some 192 points of snow in the antarctic region.

Things you can do to help with debugging:
In sub model independent > output choices - turn on the subroutine and timer IO diagnostics and select "extra diagnostics messages"

In reconfiguration > reconfiguration output select extra diagnostic messages

In compilation and modifications > compile options for the model, select the debug mode.

You can also try to run the debugger on the core file:

gdb xghwb.recon core

if the reconfiguration is suspect.

Regards

Willie


comment:2 Changed 9 years ago by sws06djb

Hi Willie,

Thanks for the advice on debugging - I've switched on those options and sent the job back in to HECTOR.

I realise that I'm working from a dump with a row of land-points. However, I thought that the point of the reconfiguration step (I think I have this switched on!) was to modify the input dump to produce a new start dump by applying the specified ancillaries. I have designed/constructed these ancillaries to remove the land-points - but perhaps I've missed the point here? I've double-checked the ape_npr.smow and ape_npr.mask ancillaries and these have no land-points as far as I can tell.

If there's something else I need to do (apart from applying the ancillaries via reconfiguration and setting landpoints=0 in the UMUI) in order to change the land-mass configuration in the model, please let me know.

Thanks,

David

comment:3 Changed 9 years ago by sws06djb

Hi Willie,

I've now rerun the job on HECTOR (xghwb) with the extra diagnostics switches on (I also set BUILDSECT=true in the SUBMIT script as instructed in the UMUI). I've looked in the umui_out file and $DATAM/$DATAW directory but I cannot find any more information to help me.

Regarding the core-file - I don't really know where to begin with this. I have looked in core files before (nearly 10 years ago!) but even then I sort of "knew" what I was looking for in terms of the structure of the core. I've no idea what this is in this case!

Thanks,

David

comment:4 Changed 9 years ago by sws06djb

Another thought on core files. The core files in the $DATAW/$DATAM directory are older than the executable - so presumably they must be from a previous crash rather than the most recent one.

Looking at the umui_out files would suggest that the run stopped after reconfiguration but never started to "run". However, looking at the output, it's definitely touched xghwba.p?k0jan etc - so it must have run at least part of a timestep.

Thanks,

David

comment:5 Changed 9 years ago by willie

Hi David,

the flush modset I suggested in #670 should also work here too. It goes in the "modifications for the model" page.

Regards,

Willie

comment:6 Changed 9 years ago by sws06djb

Hi Willie,

I've done as requested with the flush script.

xghwc is no-land points at all
xghwd is with polar rows at both poles

There seems to be quite a bit more diagnostic this time in umui_out: it does quite a few atmosphere timesteps before exploding (in the second slab timestep - presumably day 2). Please can you have a look at this.

I'll let you know when the re-run with the flush script is complete on the other problem (#670).

Thanks,

David

comment:7 Changed 9 years ago by willie

david, could you chmod a+w on your core files in xghwc, please
Willie

comment:8 Changed 9 years ago by sws06djb

Done - added r & w permissions. Thanks,

Dave.

comment:9 Changed 9 years ago by willie

Hi David,

It has a segmentation fault in time step 97, just when it is about to report the GCR(2) convergence results. Since the time step is 30 minutes, we can be reasonably sure that the issues to do with land points have been resolved.

the debugger shows trace back:

Program terminated with signal 11, Segmentation fault.
#0  0x0000000000ce7ad6 in bi_linear (dim_i_out=48, dim_j_out=10, dim_k_out=14, dim_i_in=48, dim_j_in=10,
    dim_k_in=38, halo_i=4, halo_j=5, data_in=(( ( 6371229, 6371229) ( 6371229, 6371229) ) ),
    i_out=(( ( 1) ) ), j_out=(( ( 1) ) ), weight_lambda=(( ( 0.57178993092150421) ) ),
    weight_phi=(( ( 0.1149183591335543) ) ), data_out=(( ( 6371229) ) ))
    at /esfs1/n02/n02/lrdjb/fellow/xghwc/rebuild_xghwc/src/a12_2a/Gq/bilin2a.f:87
87                     Data_out (i,j,k) = (1.-weight_lambda(i,j,k)) *
(gdb) up
#1  0x0000000000c46c73 in bi_linear_h (
    data_in=(( ( 4.9406564584124654e-324, 4.9406564584124654e-324) ( 3.9512956058187426e-62, 3.3162809731351507e-76) ) ), lambda_out=Cannot access memory at address 0x4060e00000000000
)
    at /esfs1/n02/n02/lrdjb/fellow/xghwc/rebuild_xghwc/src/a12_2a/Gq/bilinh2a.f:566
566           call bi_linear (dim_i_out, dim_j_out, dim_k_out,
Current language:  auto; currently fortran
(gdb) up
#2  0x0000000000d1f489 in ritchie (type=3, timestep=1800, u_adv=Cannot access memory at address 0x0
)
    at /esfs1/n02/n02/lrdjb/fellow/xghwc/rebuild_xghwc/src/a12_2a/Gq/ritch2a.f:3181
3181              Call Bi_Linear_H (r_theta_levels(1-halo_i,1-halo_j,temp),
(gdb) up
#3  0x0000000000c51f5e in departure_point (type=3, timestep=1800, u_adv=Cannot access memory at address 0x0
)
    at /esfs1/n02/n02/lrdjb/fellow/xghwc/rebuild_xghwc/src/a12_2a/Gq/deppnt2a.f:263
263               Call Ritchie(

and further up the chain,

    at /esfs1/n02/n02/lrdjb/fellow/xghwc/rebuild_xghwc/src/a12_2a/Gq/slthrm2a.f:529
529           Call Departure_Point(
(gdb) up
#5  0x000000000050ff23 in ni_sl_thermo (theta=Cannot access memory at address 0x0
)
    at /esfs1/n02/n02/lrdjb/fellow/xghwc/rebuild_xghwc/src/a12_2a/Gq/ni_sl_thermo.f:526
526             Call SL_Thermo(
(gdb) up
#6  0x00000000004a7026 in atm_step_ () at /esfs1/n02/n02/lrdjb/fellow/xghwc/compile_xghwc/atmstep2.f:7967
7967            Call NI_SL_Thermo(

If this is a genuine model instability, then reducing the time step may get past it.

Regards,

Willie

comment:10 Changed 9 years ago by sws06djb

Hi Willie,

I've tried rerunning a few more experiments. xghwq, as suggested halved the timestep - it blew up as before.

However, in xghwq I left the timestep alone and turned off the debugging options. This worked! Possibly the reason it failed when I tried this initially was because I'd accidently "missed" an ancillary input file in the reconfiguration on my first attempt (I think I missed it because there's one UMUI window that asks for 2 ancillary files to be specified whereas all the others only seem to want one).

Both these two experiments had 2-polar land rows (one a each pole). I'm now trying the same trick with an experiment with both polar land rows removed (xghwo) and I'll see if that works.

Thanks,

David

comment:11 Changed 9 years ago by sws06djb

Sorry - previous post should be:

xghwq (halved timestep, 2 polar rows, debug ON) = Fails
xghwp (normal timestep, 2 polar rows, debug OFF) = Works

The new run (xghqo - normal timestep, no polar rows, debug off) fails (GCR seems to stop converging after ~150 timesteps, exploding at about 157 timesteps). No idea what this means but presumably this is the model instability you were referring to.

I will try to re-run this with the timestep halved (xghwr).

As an aside, I'm not actually sure whether I'm reducing the timestep correctly. All I'm doing is changing:
model⇒atmos⇒sci params⇒timestepping: Ntimesteps=96; SWperday=16; LWperday=16.

Is that correct? I really want to get this (and hopefully the other problem) sorted quickly if at all possible because I need to start doing my "real" runs!

Thanks,

David

comment:12 Changed 9 years ago by sws06djb

The halved-timestep run I mentioned in the previous post (xghwr - halved timestep, no polar rows, debug off) still fails (timestep 7). Assuming I've managed to halve the timestep correctly, this doesn't seem to fix this problem.

Thanks,

David

comment:13 Changed 9 years ago by willie

Hi David,
Xghwq fails after 192 time steps with no warning messages: this is exactly the same point as xghwc viz 96 x 0.5 = 48 hours.

In xqhwo, the GCR fails at time step 150:

 GCR( 2 ) failed to converge in  50  iterations.

The advice from the Met Office is,

``You have introduced new mods that have altered the model's behaviour to something unrealistic. Set up the model to produce output just before the failure and analyse. Good fields to look at are: W INCR (sec 12), dW solver (sec 10), density, q and W. Find max and min values in these fields and look for regions of unrealistic values.
Case 1 - overwriting
Look along the polar rows as sometimes values can become different (even if they are supposed to be the same). If this is the case then run the model for one timestep with 1x2 and 1x4 PE configurations and cumf the two dumps produced after that timestep. If they are different then there is overwriting going on. Isolate the mod causing the problem (by individually removing mods) and reprogram.''

I hope that helps.

Regards,

Willie

comment:14 Changed 9 years ago by sws06djb

Hi Willie,

Thanks for your post above but I think we may have got to cross purposes here. This is probably my fault because I've been trying to fix things too and therefore causing confusion by trying to keep you up-to-date on my latest runs/tests. Sorry.

There is no need to fix xghwq (from my point of view) because xghwp works. In particular:
xghwq = debugging on, timestep halved = FAILS
xghwp = debugging off, timestep normal = WORKS

Thus, since xghwp is closer to what I want than xghwq there is no purpose to my investigating xghwq. The pattern of model failures suggests to me, however, that the debugging options we've switched on in xghwq are CAUSING the model to crash at the start of the second timestep in the slab (slab timestep = 1day). This is perhaps something that you or the Met Office might wish to look into?

The problem I am still left with is that xghwp has a row of land-points at both poles. This is better than a row of land-points at just one pole but it would still be better to remove ALL the land points.

As xghwp works ok (see above), I copied that job and removed the land points to create job xghwo (nothing else was changed). This fails as described 3 posts above (i.e., GCR seems to stop converging after ~150 timesteps, exploding at about 157 timesteps). I've tried halving the timestep and repeating this (xghwr), this now fails in timestep 7.

Please can you have a look at xghwo and xghwr and see if those crashes make any sense. It all seems bizarre to me: starting from a working run, I have made only small changes to the ancillaries and I've not added/removed any modsets - what on earth could cause it to crash so mysteriously? I'm reluctant to get into too much faffing about with existing modsets and fancy PE configurations unless I really have to - I have relatively little expertise with restarting from unusual times/positions and setting up fancy diagnostics and PE configurations so I'm likely to make errors in doing so (i.e., we'll end up needing to debug the debugging runs!).

Thanks,

David

comment:15 Changed 9 years ago by willie

Hi David,

I am not familiar with the details of the aqua-planet set up, so can't help much more. I've discussed this within the team and there is a vague re-collection that the land points are necessary. Further information on aqua-planet modelling with the UM, albeit a few years old, can be found at http://www.cgam.nerc.ac.uk/~mike/APE/.

Sorry I can't be of more help.

Regards,

Willie

comment:16 Changed 9 years ago by sws06djb

Hi Willie,

Thanks for the pointers. I've already spoken to Mike Blackburn (and others) but the experience here is either with HadAM3 (Mike, Steve) or with HadGAM including polar rows (Pete).

However, I don't think that the polar rows are truly "necessary" in HadGAM. I've spoken with Rachel Stratton at UKMO and she runs HadGAM aquaplanets without polar rows. I'm trying a restart dump from one of her jobs to see if this fixes things but I'm not sure I'm going to get to the bottom of this. It could be the slab ocean (although I've been unable to find any code that makes the south pole particularly "special" and my student previously had problems with polar rows in HadGAM without the slab); it could be HECTOR; it could be something/anything else.

If I find a way to fix this, I will let you know (for future reference).

Thanks for your help,

David

comment:17 Changed 9 years ago by sws06djb

Hi Willie,

I think this will probably be my final update on this for a while…

I've tried running the no-land model from a no-land-point aquaplanet start dump provided by Rachel Stratton without success.

However, the problem with HadGSM (particularly the slab-part) seems to extend further than this. In particular, having slab ocean in the *three* most southern rows on the planet seems to produce some very strange heat fluxes in the ocean and, ultimately, causes the model to crash. This may be due to the fact that I've removed all sea-ice (by setting the freezing temperature to be very low) but I doubt it (it doesn't cause a problem at the north pole).

There therefore seems to be something very strange about the model when you have slab-ocean rather than land in the area of Antarctica (last three rows). My work-around is simply to have three polar rows in each hemisphere: clearly not ideal but I can't afford any more time debugging this at the moment.

Thank you again for all your help over the last couple of weeks!

David

comment:18 Changed 9 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.