Opened 9 years ago

Closed 9 years ago

#710 closed help (fixed)

Over-writing due to dim_e_out size

Reported by: oma Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: 7.3

Description

Hello,

After solving the problems described in #706
then I found the following error message at the beginning of the .leave file:

Rank 5 [Tue Oct 11 22:26:24 2011] [c7-0c1s0n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 5
[NID 01279] 2011-10-11 22:26:24 Apid 1323486: initiated application termination

The previous message was then followed by the one below (in the 'All PE* output' section)

%PE5 OUTPUT%
 U_MODEL - OBS arrays allocated with sizes  0 2048
  ==============================================
 ********************************************************************
 UM ERROR (Model aborting) :
 Routine generating error: Interpolation
 Error code:  10
 Error message: 
over-writing due to dim_e_out size
 ********************************************************************
 gc_abort (Processor  5 ): over-writing due to dim_e_out size

Thanks in advance for your help.

Oscar

Change History (3)

comment:1 Changed 9 years ago by annette

Hi Oscar,

There's some advice on the Collaboration wiki under Unified Model Errors which might be relevant:

http://collab.metoffice.gov.uk/twiki/bin/view/Support/UnifiedModelErrors

I've just cut and pasted the section in here:

Explanation

The semi-lagrangian advection scheme looks upwind to find grid boxes near where a parcel of air would have originated from. In some cases it finds too many boxes (due to overlapping halos) and the array holding the data for these boxes gets full and overflows.

Fix

Change your processor configuration so that your are using considerably more North-South processors than East-West processors. This tries to ensure that the halos near the pole (which are squashed in the x direction) are roughly square. On the IBM this might be 2x16 for 1 node and 8x16 for 4 nodes.

Annette

comment:2 Changed 9 years ago by oma

Hi Annette,

Thanks for the piece of advice! I also found a relevant paragraph at 'Errors and solutions on the HECToR phase2b (Cray XE6) service' at

http://cms.ncas.ac.uk/index.php/hpc-faqs/1544-error-messages-and-solution-on-the-hector-phase2b-cray-xe6-service

I have included both suggestions. The model started running but failed after about 9 hours into the simulation. I changed to a startdump for 12 hours later and it worked!

Thanks,

Oscar

comment:3 Changed 9 years ago by annette

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.