Opened 5 years ago

Closed 5 years ago

#1244 closed help (fixed)

vn8.4 crash at beginning of 2010

Reported by: dan2012 Owned by: um_support
Component: UM Model Keywords: vn8.4 start 2010 crash
Cc: mohit.dalvi@… Platform: MONSooN
UM Version: 8.4

Description

Hi,
I am experiencing a failure of vn.8.4 for my jobs right at the beginning of 2010.
The simulations run fine for several years (from 2005) up to this point and then fail (floating point exception).
This occurs for two runs having different emissions (pre-industrial and present day), JOBID XJCCH, XJCCG.
The .leave file can be seen for one here:
/home/dapart/output/xjccg000.xjccg.d14066.t110235.leave

As suggested by others, in the past, failures in sulphr() were due to transport problems.
However, I have tried the recommended get around (changing convective timestep for whole simulation, and changing convective timestep for month in question that fails with no luck).

As we need to submit model results for an inter-comparison from 2005 to end of 2010 it is crucial to get this year running ASAP, so any help locating the reason for the crash would be greatly appreciated -it is interesting that it is always occurring right at the beginning of the year.
Many thanks in advance, Daniel

Change History (4)

comment:1 Changed 5 years ago by willie

Hi Daniel,

A floating point error after only two time steps suggests a problem with the initial data or with the algorithms used. I checked the start dump and it is free of NaNs?, so that is good. You should identify any non-standard data files and check them with xconv and cumf. The trace back in the leave file can also suggest which branches to look at. You can get line numbers if you select the debug option and compile the model in the UMUI "Compilation and run options".

Regards

Willie

comment:2 Changed 5 years ago by dan2012

Hi Willie,
I have performed a traceback using debugging options:
xjccg000.xjccg.d14071.t122449.leave

This provides a line number, however, this line number points to a line related to an open MP directive, not the type of floating point error shown in the .leave file.

I am not sure how to debug this further. Any suggestions would be great.

best regards,
Daniel

comment:3 Changed 5 years ago by willie

Hi Daniel,
Chasing the errors is some times a fruitless task since the reported error occurs long after the true cause, although switching on the debug code has changed it from a wide variety to floating point errors down to a consistent divide by zero.

Since the last successful run of xjccg on 4 Mar, I believe you have only changed,

  • the start date
  • the start dump
  • the number of convection calls per physics time step

The start dump generally has fields with a reference date of 2005/09/01, except for a few fields at the end (e.g. field 680 "Analysis Temperature on model levels") which are dated 2001/05/02. I don't know if that is a problem, but it may be worth checking.

Regards,

Willie

comment:4 Changed 5 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.