Opened 9 years ago

Closed 9 years ago

#670 closed help (fixed)

Aquaplanet HadGSM1 crash - TOA insolation mod

Reported by: sws06djb Owned by: um_support
Component: UM Model Keywords:
Cc: d.j.brayshaw@… Platform:
UM Version: 6.1

Description

Hello,

I'm currently setting up some idealised slab-ocean experiments in HadGAM1 to explore energy transport responses in the atmosphere. As part of this I need to impose an unusual insolation profile at the top of the atmosphere.

I already have a working "perpetual equinox" aquaplanet (job xghwa) where the Earth's orbit is set to a circle and the axial tilt removed (modset /home/n02/n02/lrdjb/um/djb_mods/ape_orbit_js.mf77). This works fine.

However, I now wish to impose a profile of insolation similar to the annual-mean rather than the equinoctal (job xghwk). To this end, I have created modsets (/home/n02/n02/lrdjb/um/djb_mods/ann_mean_s.mf77 and .mh) to adjust the solar parameters AND the zenith angle.

The model crashes, but without much in the way of diagnostics (q_pos is reset non-conservatively according to the .peXX files but, given that it seems to be starting to write an output pp-file I assume is perhaps related to stash?). Based on experience with a previous problem, I've tried turning all packing off in STASH but the model still fails (xghwl).

I will be continuing to experiment to try to fix this (it is quite possible that my new toa insolation profile is "wrong" somehow) but any advice on what is could be causing the crash and things to try to fix it would be very appreciated.

Thanks,

David

Change History (11)

comment:1 Changed 9 years ago by willie

Hi David,

Some nasty MPI errors have appeared in job xghwk:

Rank 1 [Mon Aug  8 19:41:50 2011] [c8-1c0s1n0] Fatal error in PMPI_Barrier: Message truncated, error stack:
PMPI_Barrier(363)...................: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(251)..............: 
MPIR_Bcast_impl(1150)...............:

The model has become unstable in the first time step:

  RHS zero so GCR( 2 ) not needed

The difference with job xghwa, which works, is

<  $MY_MODSETS/ape_orbit_js.mf77 Y
---
>  $MY_MODSETS/ape_orbit_js.mf77 N
34a35,36
>  /home/n02/n02/lrdjb/um/djb_mods/ann_mean_s.mf77 Y
>  /home/n02/n02/lrdjb/um/djb_mods/ann_mean_s.mh Y

So I guess that ape_orbit_js.mf77 is essential.

Regards,

Willie

comment:2 Changed 9 years ago by sws06djb

Hello again Willie,

Thanks for looking at this. I don't think that the ape_orbit_js.mf77 modset is the key here though.

If you look at the ann_mean_s.mf77 modset, the first thing it does is identical to ape_orbit_js.mf77 (setting the orbital parameters in SOLPOS to be equinoctal). I figured that, when I wrote my new modset (i.e., ann_mean_s.mf77), that it was easier to include/repeat the code from the first modset rather than have to apply two modsets separately.

I'll rerun the job with the extra diags you suggested in ticket 669 - but the question I'm trying to understand is why the extra code in ann_mean_s.mf77 (compared to ape_orbit_js.mf77) causes the crash. What do the MPI and GCR(2) messages mean? Is there an "efficient" way to find out how it relates to the extra code?

Thanks,

David

comment:3 Changed 9 years ago by sws06djb

Hi Willie,

This run (xgwhl) has now been repeated with extra diagnostics on - and the extra stuff (such as it is) seems to be totally uninformative as far as I can tell.

There are, however, 3 core files produced by this one. I ran the gdb programme (as per ticket #669) and it seems to have crashed in the model executable rather than the reconfig. Without really knowing what to do, I typed "up" several times and, from what I can tell, the model crashed during the semi-lagrangian advection phase (in subroutines within SL_Thermo). I tried to see if there was anything particularly strange in the variables reported. There were some *very* small numbers in qcf2_star, qcl etc (either zero or 3.E-320) but I've no idea whether this is normal or not for the model.

Help/advice would be much appreciated.

Thanks,

David

comment:4 Changed 9 years ago by willie

Hi David,

Could you add the modset /home/n02/n02/wmcginty/modsets/flush.mf77 and run the job again. At the moment the reconfiguration works but we are getting no output from the model. The flush should fix that, provided the timers are on. You need to clear the output directory including the core files first.

Regards,

Willie

comment:5 Changed 9 years ago by sws06djb

Hi Willie,

I've now run this again (job xghwl). Seems to crash in the first timestep - where it seems to be adding loads of moisture - perhaps I have simply hit the model too hard with the change in insolation? Does that sound plausible? Any thoughts on how to get round this would be helpful.

Thanks,

David

comment:6 Changed 9 years ago by willie

Hi David,

It fails in the first time step and the trace back is almost the same as for #699, except at the lowest level it fails in bilinh2a.f line 487.

There are, I think, two possibilities - faulty initial data or faulty code. To try to go a bit further you could repeat the run with more diagnostics. Go to scientific > section by section > section13 and tick diagnostic prints. Push the diag_prn button at the bottom, tick diagnostic prints again and select every time step; use a vertical velocity value of 10.0.

In control > post process > dumping and meaning, check the time steps button and select restart dumps every time step.

Regards,

Willie

comment:7 Changed 9 years ago by sws06djb

Hi Willie,

I've added the extra diagnostics here and it has crashed as before (job xghwm). I'm also going to try a run with timestep halved (as per #699) but I'm not sure this will help.

As an aside, this is NOT because I've mis-specified ancillary files (see my most recent comment on #699). The ancillary files here are identical to xgwha (one polar row of points and equinoctal insolation) which works fine.

If you can have a look and suggest anything based on the diags from xghwm then that would be great.

David

comment:8 Changed 9 years ago by sws06djb

Hi Willie,

I've compeleted a few more tests here:

xghwn =
As xghwm (annual mean insolation + second lot of extra diagnostics) but with halved timestep. Still fails.

xghww =
As xghwn (annual mean insolation + halved timestep) but all the "extra" diagnostics removed. Still fails.

xghwv =
As xghwk (original annual mean insolation run, normal timestep, normal diagnostics) but with all STASH removed. Still fails.

All the failures appear to occur in the first timestep. Presumably my new modset (ann_mean_s.mf77) is causing the model to do something it doesn't like… is there any sensible and efficient way to tie down what this is from the diagnostic output above? Otherwise, is the only/best way to proceed is to manually add more diagnostic messages in via modsets? I think my code should be ok but I haven't managed to directly "check" it's output in any meaningful sense.

Thanks,

David

comment:9 Changed 9 years ago by willie

Hi David,

The extra diagnostics have not produced any information and there have been no dumps at each time step. These tests are designed to detect the model becoming unstable and blowing up. But it does not look as if it is getting as far as doing the integration.

I think the best way to proceed is to add print statements to the ann_mean_s.mf77 modset and check that it is doing what it ought.

Regards,

Willie

comment:10 Changed 9 years ago by sws06djb

Hi Willie,

Just thought I'd send an update on this. I think I have this working now - embarrassingly it was my own fault (my modset was not doing what I thought it was!).

Thanks (and sorry for wasting your time on this one). The other problem #669 is still being a real pain though!

David

comment:11 Changed 9 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.