Opened 10 years ago

Closed 10 years ago

#546 closed help (fixed)

HadGEM2-ES on Monsoon - crash for insufficient space

Reported by: abozzo Owned by: um_support
Component: UM Model Keywords: ancillary, header
Cc: Platform:
UM Version: 6.6.3

Description

Hi,

I'm testing a modified version of HadGEM2-ES on MONSooN. The changes are just to accept solar and volcanic forcing from year 800.
I'm trying to run a test simulation around year 1258. I successfully compiled the modified version and started the simulation. It was suppose to run for 2 months (it should take not more than 2 hours). After 3h30min the job (xfkme) crashed with the error message:

UM ERROR (Model aborting) :
Routine generating error: INITIAL
Error code: 14
Error message:
INANCILA: Insufficient space for LOOKUP headers

In particular, the message seems to refer to one ancillary file. From the output file xfkme.fort6.pe10:

Ancillary data file 38 , unit no 154 , Clim biogenic aerosol
No room in LOOKUP table for Ancillary File 38
INANCCTL: Error return from INANCILA 14
INANCILA: Insufficient space for LOOKUP headers
Failure in call to INANCCTL

Any idea if this has to do with the modification I've made to the code to accept longer volcanic aerosol input?

Many thanks,
Alessio

Attachments (1)

xfkmf000.xfkmf.d10338.t123542.leave (10.1 KB) - added by abozzo 10 years ago.

Download all attachments as: .zip

Change History (16)

comment:1 follow-up: Changed 10 years ago by willie

  • Keywords ancillary, header added

Hi Alessio,

You could try increasing the number of Ancil headers. It's in Atmosphere > Ancillary and Input Files > In file options > Header Record Sizes. You can over specify, so increase it from 4000 to 6000, say.

Regards,

Willie

comment:2 in reply to: ↑ 1 Changed 10 years ago by abozzo

Hi Willie,

many thanks. The model was resubmitted and ran smoothly for 2 of the 3 years of my test simulation. Then it crashed with this message error from the processor #31

%PE31 OUTPUT%

na_spec nam: HO2S sum user_d: 0.0000E+00 sum zfnatr: 0.0000E+00

l2norms printing too often, limited to 12 occasions
l2norms printing too often, limited to 12 occasions
*
UM ERROR (Model aborting) :
Routine generating error: Interpolation
Error code: 10
Error message:

over-writing due to dim_e_out size

Also, the following warning message was output during the whole run:

Time varying volcanic forcing has been selected but
the check that the run is global has been disabled.
Only use this option for global simulations.
*
UM WARNING :
Routine generating warning: NI_rad_ctl3c
Warning code: -1
Warning message:

WARNING: VOLCANIC FORCING

The resubmission of the jub doesn't work.
Any ideas why the model stop running?

Many thanks
Alessio

Replying to willie:

Hi Alessio,

You could try increasing the number of Ancil headers. It's in Atmosphere > Ancillary and Input Files > In file options > Header Record Sizes. You can over specify, so increase it from 4000 to 6000, say.

Regards,

Willie

comment:3 Changed 10 years ago by willie

Hi Alessio,

This is sometimes due to the semi-lagrangian advection scheme. Could you try using considerably more North-South processors than East-West. You could go from 8x8 to 16 North-South x 8 East West.

Regards,

Willie

Changed 10 years ago by abozzo

comment:4 Changed 10 years ago by abozzo

Hi Willie,

thank you. I've been trying with the 16x8 configuration, but with no much luck. Please find attached the .leave file: the model hangs for more than 2hours and then crashes without any clear error message.

I'm struggling to understand what could possibly go wrong: the control version (same as the MetOffice? control for the CMIP5) works. The version with my changes to read extended version of volcanic and solar forcings either crashes after 2 of the 3 years of the simulation with the 8x8 processors configuration, or doesn't work with the 16x8 configuration.

Thanks for help,
Alessio

comment:5 Changed 10 years ago by willie

Hi Alessio,

There were two tricks I was going to use to solve the dim_e_out problem. One was increase the number of NS processors and the other was configure the land-sea mask. You are doing both of these.

I did a check setup and there are some problems with the STASH and user STASH. The STASH problems can easily be corrected by selecting "not included" if the STASH is deemed not available. More serious (I think) is the user STASH problem. Visit the user STASH page and then push the prognostics button. A number of windows appear complaining about a broken code. These all come from ~ros/HadGEM2/HG2L60_local/userstash/epflux606. This appears to be incompatible with your setup. Perhaps this is the cause?

Regards.

Willie

comment:6 Changed 10 years ago by abozzo

Hi Willie,

thanks. I'm still struggling with this problem. Apparently the stash problem is not the main cause of the crash. I got rid of the bad stash codes but the model still crashes with the same error. And there is no way I manage to run it with the 16x8 configuration.
I noticed in the .leave file an error message at the beginning, right after the model run begins:

/projects/lastmil/abozzo/um/xfkmf/bin/qsexecute[814]: assign: not found.

In qsexecute the "assign" commands is related to OASIS. I checked and I don't have an "assign" command defined in MONSooN. Could it be a possible cause?

Thanks
Alessio

comment:7 Changed 10 years ago by willie

Hi Alessio,

I have tried repeating your run. The output does not mention dim_e_out, but I am getting an error

ERROR: 0031-250  task 17: Trace/BPT trap

and when I look at the core file using dbx, the full message is

Trace BPT trap in glue rad at line 2551 in

..../ummodel/ppsrc/UM/atmosphere/short_wave_radiation/glue_rad-rad_ctl3c.f90

call solvar(PREVIOUS_TIME(1), SCS,

and it looks to me as if previous_time(1) has not been set.

I hope that helps,

Regards

Willie

comment:8 Changed 10 years ago by abozzo

Hi Willie,

many thanks. I've been trying to run the UM with the runtime array bounds check switched ON in order to check if my subroutines had those kind of error. Sorry, I didn't mention that (I added this friday). That's why of the BPT trap error.
Though, I unfortunately found that even the UM code without my changes is unable to run with the arrays bound check on.
When you run the code without including my user override file (arr_bd_ck). The model runs and then it generates the usual dim_e_out error.

Many thanks again for you help,
Regards
Alessio

comment:9 Changed 10 years ago by willie

Hi Alessio,

I removed the override and it seems to run perfectly for 1440 time steps. This is my job xfqic.

Regards,

Willie

comment:10 Changed 10 years ago by abozzo

Hi Willie,

thanks for your patience. The model used to run for the first month for me as well, but in the CRUN it crashes after 12 months. I'm running it right now with your configuration and see what happens.
I noticed you added the script insert UM_SECTORSIZE 4096. Does this make a difference?

Regards,
Alessio

comment:11 Changed 10 years ago by willie

Hi Alessio,

I noticed that in the leave file there was a warning about not setting UM_SECTOR_SIZE, so I thought I'd set it. But in my run I get the same warning despite setting it.

Regards,

Willie

comment:12 Changed 10 years ago by abozzo

Hi Willie,

the CRUN of xfkme ended last night. It ran fine for 12 months and then crashed (?) with the usual message in the .archive file:

gc_abort (Processor 31 ): over-writing due to dim_e_out size

Traceback:

Offset 0x00000010 in procedure xltrbk_
Offset 0x000000f8 in procedure gc_abort_, near line 180 in file /projects/um1/gcom/gcom3.5/meto_ibm_pwr6_mpp/ppsrc/gcom/gc/gc_abort.f
Offset 0x000003b0 in procedure ereport_, near line 384 in file /projects/lastmil/abozzo/xfkme/ummodel/ppsrc/UM/control/misc/ereport.f90
Offset 0x00009744 in procedure interpolation_, near line 1149 in file /projects/lastmil/abozzo/xfkme/ummodel/ppsrc/UM/atmosphere/dynamics_advection/interpolation.f90
Offset 0x00007f4c in procedure sl_thermo_, near line 982 in file /projects/lastmil/abozzo/xfkme/ummodel/ppsrc/UM/atmosphere/dynamics_advection/sl_thermo.f90
Offset 0x00000f94 in procedure ni_sl_thermo_, near line 705 in file /projects/lastmil/abozzo/xfkme/ummodel/ppsrc/UM/atmosphere/dynamics_advection/ni_sl_thermo.f90
Offset 0x000165f8 in procedure atm_step_, near line 10341 in file /projects/lastmil/abozzo/xfkme/ummodel/ppsrc/UM/control/top_level/atm_step.f90
Offset 0x0001bc28 in procedure u_model_, near line 5353 in file /projects/lastmil/abozzo/xfkme/ummodel/ppsrc/UM/control/top_level/u_model.f90
Offset 0x0000230c in procedure um_shell, near line 4312 in file /projects/lastmil/abozzo/xfkme/ummodel/ppsrc/UM/control/top_level/um_shell.f90
—- End of call chain —-

I'm going to have a look if at least the solar and volcanic inputs have been read in correctly.

Regards,
Alessio

comment:13 Changed 10 years ago by willie

Hi Alessio,

Looking at the last thing in pe31's output,

 Gone wrong in interpolation  961 5 192 1308
 191 10 78 141 0 0 747 141
 *********************************************************************************
 UM ERROR (Model aborting) :
 Routine generating error: Interpolation
 Error code:  10
 Error message: 
over-writing due to dim_e_out size
 *********************************************************************************

So it is complaining about the interpolation.

I hope that helps.

Regards,

Willie

comment:14 Changed 10 years ago by abozzo

Hi Willie,

I'm back on the track after the christmas break. Apparently I found a workaround to the crash, although I'm not sure whether it make sense.

The model run (xfkme) started on Jun 1257 with a reconfigured dump. The NRUN ended normally and I submitted the CRUN. It ran with no problems until the first half of dec 1258, then it crashed with the usual error message:

UM ERROR (Model aborting) :

Routine generating error: Interpolation
Error code: 10
Error message:

over-writing due to dim_e_out size

The last dump produced had the date of 1st fo dec 1258. I used that dump to start a new run (xfkmi) with the same configuration of the previous one.
It started from where xfkme crashed and ran with no problem for 2 years.

I looked at the time series created stitching together the output from the 2 jobs (xfkme from Jun 1257 to Nov 1258 and xfkmi from Dec 1258 to Dec 1260) for few fields and it doesn't show any weird value.

I tried also to start the new run from other dumps from xfkme, but it always crashes in Dec 1258 no matter what starting point I choose.
The only way to go over the crash point is to restart the run from the last dump, dated 1st of Dec 1258.

Does this behavior sound strange? Or does it make sense?

I think we can consider for now closed this ticket.
Many thanks,
Alessio

comment:15 Changed 10 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed

Hi Alessio,

May be it is a gradual instability and restarting from the latest start dump is enough to keep it under control? Anyway, I shall close the ticket.

Regards,

Willie

Note: See TracTickets for help on using tickets.