Opened 11 years ago

Closed 11 years ago

#304 closed help (fixed)

seg violation with UM 4.5 on Eddie

Reported by: mjm Owned by: jeff
Component: UM Model Keywords: segmentation
Cc: morak.simone@… Platform:
UM Version: 4.5

Description

Hello

A request for info and any guidance. Some specific questions follow a report of the present situation with a run:

This run, xdlij, was using a shorter timestep. The history is that a run xdlii failed with negative pressure; we restarted the present run with half atmos timestep and it is xdlij that has the seg fault.

Extracts of trace are as follows from the leave file (/exports/work/geos_sages_workspace/s0797074/um/umui_out/xdplj000.xdplj.d09205.t175314.leave):

in qsmain within the leave file we had:

MPIRUN.eddie107: 27 ranks have not yet exited 60 seconds after rank 22 (node eddie113) exited without reaching MPI_Finalize().

MPIRUN.eddie107: Waiting at most another 60 seconds for the remaining ranks to do a clean shutdown before terminating 27 node processes

1582.36s real 0.08s user 0.07s system

xdplj: Run failed

and in PE0_OUTPUT in the leave file:

ATMOS TIMESTEP 46367

ATMOS TIMESTEP 46368
FINAL TOTAL ENERGY = 0.45125E+27 J/
INITIAL TOTAL ENERGY = 0.45128E+27 J/
CHG IN TOTAL ENERGY OVER DAY = -0.33878E+23 J/
FLUXES INTO ATM OVER DAY = 0.10612E+23 J/
ERROR IN ENERGY BUDGET = 0.44490E+23 J/
TEMP CORRECTION OVER DAY = 0.24697E-01 K
TEMPERATURE CORRECTION RATE = 0.28584E-06 K/S
FLUX CORRECTION (ATM) = 0.28828E+01 W/M2
FINAL ATM MASS = 0.17925E+22 KG
INITIAL ATM MASS = 0.17925E+22 KG
CORRECTION FACTOR FOR PSTAR = 0.10000E+01
im,sm,ngroup,new_im,new_sm 2 2

24 T T

TRANSOUT: Copied into memory LEN_DATA= 226386 submodel=

1

TRANSIN : Copied from memory LEN_DATA= 911792 submodel=

2

TS= 11580 YEAR= 1.34 DAY=122.5 ENERGY= NaN DTEMP= NaN DSALT= NaN SCANS=501
TS= 11592 YEAR= 1.34 DAY=123.0 ENERGY= NaN DTEMP= NaN DSALT= NaN SCANS=501
im,sm,ngroup,new_im,new_sm 1 1

96 T T

TRANSOUT: Copied into memory LEN_DATA= 911792 submodel=

2

TRANSIN : Copied from memory LEN_DATA= 226386 submodel=

1

ATMOS TIMESTEP 46369


The final lines in the xdplj.fort6.pe22 file were:

ATMOS TIMESTEP 46367
ATMOS TIMESTEP 46368
FINAL ATM MASS = 0.17925E+22 KG
INITIAL ATM MASS = 0.17925E+22 KG
CORRECTION FACTOR FOR PSTAR = 0.10000E+01
im,sm,ngroup,new_im,new_sm 2 2

24 T T

TRANSOUT: Copied into memory LEN_DATA= 216607 submodel=

1

TRANSIN : Copied from memory LEN_DATA= 781536 submodel=

2

im,sm,ngroup,new_im,new_sm 1 1

96 T T

TRANSOUT: Copied into memory LEN_DATA= 781536 submodel=

2

TRANSIN : Copied from memory LEN_DATA= 216607 submodel=

1

ATMOS TIMESTEP 46369


My questions:

Please can you give guidance on the NaNs? - should we be able to trap these in anticipation of failure?

What are the numbers being output? DSATL, SCANS etc.

We are currently recompiling with debug, and seeking to catch the core file (which is currently written to a non-existent directory by the UM scripts). Any other suggestions?!

Regards
Mike

Change History (2)

comment:1 Changed 11 years ago by jeff

  • Owner changed from um_support to jeff
  • Status changed from new to accepted

Hi Mike

You can trap floating point errors using ifort option -fpe0 and the -traceback option could also give you useful information, see the ifort man page for more details. I wouldn't use these options in a standard run as they may affect performance, only use them when the model has problems. Looking at the core dump should tell you which subroutine the problem occured in, then using -g on that routine should give you the line number. Knowing where the model first blows up may not be that useful if the model is unstable.

Further comment from mike


After the negative pressure stopped a run, we did a short run with shorter
timestep from a dump a little time before the problem. It crashed with the
seg error. We repeated the run with compilation of -g to trap this - and
it ran ok for 5 years. We then continued with the original executable from
the last dump of that 5 year run. Is this all ok? I know for the 5 years
we had a slightly different model, but assume the crashes are due to
numberical instablility and we therefore have to change the model to
continue the run

If this all seems sensible and a justifiable course of action then this
error can be closed. Is it ok?!


If you are happy with this then I would continue the run, it seems sensible to me.

comment:2 Changed 11 years ago by jeff

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.