Opened 8 years ago

Closed 8 years ago

#824 closed help (fixed)

Runs failing to restart

Reported by: SimonDriscoll Owned by: willie
Component: UM Model Keywords:
Cc: Platform:
UM Version: 6.6.3

Description

Query opened on behalf of Simon.
The 2 runs xgyte and xhalb are failing to restart.


On one run, the dates have commenced as normal corresponding to 2086:

-rw-r--r-- 1 sdrisc n02  244776960 2012-03-16 14:52 xgytea.pcs6jan
-rw-r--r-- 1 sdrisc n02  148094976 2012-03-16 14:52 xgytea.pjs6jan
-rw-r--r-- 1 sdrisc n02  798916608 2012-03-16 14:52 xgytea.pis6jan
-rw-r--r-- 1 sdrisc n02  112214016 2012-03-16 14:52 xgytea.pgs6jan
-rw-r--r-- 1 sdrisc n02    8241152 2012-03-16 14:52 xgytea.pfs6jan

but the other has dates of 00's:

-rw-r--r-- 1 sdrisc n02  236929024 2012-03-16 13:48 xhalba.pc02oct
-rw-r--r-- 1 sdrisc n02  755417088 2012-03-16 13:48 xhalba.pa02oct
-rw-r--r-- 1 sdrisc n02  148094976 2012-03-16 13:48 xhalba.pj02oct
-rw-r--r-- 1 sdrisc n02  798916608 2012-03-16 13:48 xhalba.pi02oct
-rw-r--r-- 1 sdrisc n02  112214016 2012-03-16 13:48 xhalba.pg02oct
-rw-r--r-- 1 sdrisc n02    5177344 2012-03-16 13:48 xhalba.pf02oct
-rw-r--r-- 1 sdrisc n02    2102368 2012-03-16 13:48 xhalba.pf02nov
-rw-r--r-- 1 sdrisc n02   35536896 2012-03-16 13:48 xhalba.pe02oct
-rw-r--r-- 1 sdrisc n02  139400256 2012-03-16 13:50 xhalbo.pb02oct

is this normal? Should this just be treated 2088 ('00'=2086) and if so why has it not just been written as the one above (i.e. why is it not like xhalba.pas8oct)?

The one run with normal dates (xgyte) has once again collapsed. I'm completely confused as to why this is (as it's a copy of a control that has ran fine for twenty years before).

I got an email saying:

PBS Job Id: 593951.sdb
Job Name:   xgyte001
Aborted by PBS Server
Job cannot be executed
See job standard error file

and then the 20:00 .archive file (last one) says only:

--------------------------------------------------------------------------------

Resources requested: mpparch=XT,mppnppn=32,mppwidth=96,ncpus=1,place=pack,walltime=05:53:20
Resources allocated:

*** sdrisc   Job: 593951.sdb   ends: 17/03/12 20:00:10   queue: par:4n_6h ***
*** sdrisc   Job: 593951.sdb   ends: 17/03/12 20:00:10   queue: par:4n_6h ***
*** sdrisc   Job: 593951.sdb   ends: 17/03/12 20:00:10   queue: par:4n_6h ***
*** sdrisc   Job: 593951.sdb   ends: 17/03/12 20:00:10   queue: par:4n_6h ***
--------------------------------------------------------------------------------

The one immediately before it (19:55) has a fail message:

  INITMEAN: ***** Called in ATMOSPHERIC mode *****
  H_STEP= 6480
 INITMEAN: ***** Called in OCEAN mode *****
 INITMEAN: No means requested
  INITTIME; analysis_hrs,incr_secs  0.,  0
  INITTIME; model_data_time=  1859,  12,  1,  3*0
  INITTIME; model_basis_time=  1859,  12,  1,  3*0
 INITTIME: Warning- New STEP doesn't match old value                            
 Internal model id 1  Old= 6480  New= 5862240
 INITTIME: Warning- New STEP doesn't match old value                            
 Internal model id 2  Old= 2160  New= 1954080
 INITTIM: Modifying TARGET_END_STEP from  12960
 Modified to  12960
 Failure in call to INITTIME
 *********************************************************************************
 UM WARNING :
 Routine generating warning: INITIAL
 Warning code:  -1
 Warning message:
INITTIME: Warning- New STEP doesn't match old value
 *********************************************************************************
  ANCIL_REFTIME set by User Interface =  1978,  12,  1,  3*0 

and a similar message occurs in previous .archive files.

Is this associated with the start dump etc("New STEP doesn't match old value Internal model id 1 Old= 6480 New= 5862240").?

*Thanks*!,

Simon

Change History (22)

comment:1 Changed 8 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Simon,

You may have run out of disc space. Job xgyte has done 443 restarts out of a possible 540 and is using 142GB; job xhalb is using 305GB. If you type,

lfs quota -u sdrisc /esfs1 | grep "/esfs1" | awk '{printf("WORK %5.2f %%\n",100*$2/$3)}'

you will get a percentage of your quota.

Regards,

Willie

comment:2 Changed 8 years ago by SimonDriscoll

Hi Willie,

lfs quota -u sdrisc /esfs1 | grep "/esfs1" | awk '{printf("WORK %5.2f> %%\n",100*$2/$3)}'

gives

WORK 50.53> %

Which is consistent with what I see if I login in to the HECToR SAFE website to check my disk usage, and also if I do it directly on HECToR through a terminal.

Best regards,

Simon

comment:3 Changed 8 years ago by willie

Hi Simon,

Thanks. I would do a UMUI job difference between xgyte and the job that ran for 20 years. This is sometimes useful.

Regards,

Willie

comment:4 Changed 8 years ago by SimonDriscoll

Hi Willie,

good thinking.

Differences

Job xgvdu Title cp xgvda - control run
Job xgyte Title High Top Control on HECToR - 2080 with 30hPa control, varying VOLC_UP_PRESS (and no volc) cp xgytd
Difference in window subindep_Runlen

→ Model Selection

→ Sub-Model Independent

→ Start Date and Run Length Options

Entry box: Years

Job xgvdu: Entry is set to '20'
Job xgyte: Entry is set to '15'

Entry box: Day

Job xgvdu: Entry is set to '1'
Job xgyte: Entry is set to '0'

Entry box: Month

Job xgvdu: Entry is set to '12'
Job xgyte: Entry is set to '0'

Entry box: Year

Job xgvdu: Entry is set to '2080'
Job xgyte: Entry is set to '0'

Difference in window subindep_JobRes

→ Model Selection

→ Sub-Model Independent

→ Job submission, resources and re-submission pattern

Entry box: Job time limit (seconds)

Job xgvdu: Entry is set to '19300'
Job xgyte: Entry is set to '21500'

Difference in window subindep_JobRes2

→ Model Selection

→ Sub-Model Independent

→ Job submission, resources and re-submission pattern

→ Follow on window

Entry box: Job time limit, for QSUB

Job xgvdu: Entry is set to '19000'
Job xgyte: Entry is set to '21200'

Difference in window subindep_FCM_Mods

→ Model Selection

→ Sub-Model Independent

→ FCM Configuration

→ FCM Configuration Optional Modifications

Differences in Table User Modifications

22c22

< fcm:um_br/dev/SimonDriscoll/HG6.6.3_new_volcanic_aerosol/src 8173 Y
—-

fcm:um_br/dev/SimonDriscoll/HG6.6.3_new_volcanic_aerosol/src 8692 Y

Difference in window atmos_InFiles_Start

→ Model Selection

→ Atmosphere

→ Ancillary and input data files

→ Start dump

Entry box: and file name

Job xgvdu: Entry is set to 'akgiea.das0c10'
Job xgyte: Entry is set to 'xgytea.das5b10'

Entry box: Enter directory or Environment Variable

Job xgvdu: Entry is set to '/work/n02/n02/sdrisc/start_dumps'
Job xgyte: Entry is set to '/work/n02/n02/sdrisc/xgyte'

Difference in window ocean_InFiles_Start

→ Model Selection

→ Ocean GCM

→ Input Files

→ Start dump

Entry box: and file name

Job xgvdu: Entry is set to 'akgieo.das0c10'
Job xgyte: Entry is set to 'xgyteo.das5b10'

Entry box: Enter directory or Environment Variable

Job xgvdu: Entry is set to '/work/n02/n02/sdrisc/start_dumps'
Job xgyte: Entry is set to '/work/n02/n02/sdrisc/xgyte'

The only major thing is the branch. However, it has run for around 5 years with xgyte without a problem and I'm certain the code changes made in the branch are fine.

Best regards,

Simon

comment:5 Changed 8 years ago by willie

Hi Simon,

It looks like your ocean dump is corrupt. You need to

export MALLOC_CHECK_=0

and then run xconv on xgyteo.das5b10. If you look at field field 107 ( OT18: DIATOM-CHLORO; CFC12; EXTRA-C) for example, in data view you will see lots of NaNs?.

Regards,

Willie

comment:6 Changed 8 years ago by SimonDriscoll

Hi Willie,

great, thanks.

I'll go to a previous dump and run from there and see how that works.

Best,

Simon

comment:7 Changed 8 years ago by SimonDriscoll

Hi Willie,

looking back through the five years of runs the ocean start dumps are all like this.

I have an ocean (and corresponding atmosphere) dump from a MONSooN control run /work/n02/n02/sdrisc/start_dumps/akgieo.das0c10 which appears to be fine.

I used these originally to begin the run with on HECToR. I asked at the time the relevant things to change (such as start date etc.) for the new dump, so I assume this procedure is fine. Is there anything you could think of that could cause this error? Could you still also tell me what I need to change to have a run from a new start dump, to confirm I make all the right changes for a start dump?

Best regards,

Simon

comment:8 Changed 8 years ago by willie

Hi Simon,

I don't know of any reason why you shouldn't use your good ocean start dump to start xgyte. I don't know of any special actions you need to take.

Regards,

Willie

comment:9 Changed 8 years ago by SimonDriscoll

Hi Willie,

ok. My point is is that the good dump is the one I used, then ran for five years and have now found issues with these NaN's.

I assume I just have to change Start Date and Run Length options to coincide with the start dump time, the atmosphere dump, the ocean dump and this is it to start with a dump?

Best regards,

Simon

comment:10 Changed 8 years ago by SimonDriscoll

Hi Willie,

also, are some values NaN's simply because of the run set-up? e.g. due to difference in sulphur cycles say, some variables may not be used in one run, and yet used in another… perhaps? Can a run run with ocean NaN's for some chemical species? Or are we sure that NaN's definitely caused the crash (it didn't crash before for five whole years with these values as NaN's for certain chemical species).

Best regards,

Simon

comment:11 Changed 8 years ago by willie

Hi Simon,

NaN's are absolutely forbidden: they indicate that something has gone wrong. Either the input data is in error, the code is in error or is unstable.

I am not sure of your set up so it is difficult to comment. Could you do a six month run and look at the output ocean dumps (does it output ocean dumps?). I just looked at you ocean STASH and it is full of errors. This could be the problem. Go to the ocean STASH page and type Ctl-V.

I hope that helps.

Regards,

Willie

comment:12 Changed 8 years ago by ros

Hi,

Just saw this in one of Willie's replies:

'If you look at field field 107 ( OT18: DIATOM-CHLORO; CFC12; EXTRA-C) for example, in data view you will see lots of NaNs?.
'

This DIATOM-CHLORO diagnostic is known to cause problems on HECToR and is switched off in all the standard jobs. Try switching off

Section:0 Item: 120 OT18: DIATOM-CHLORO: CFC12; EXTRA-C

in the ocean STASH panel and hopefully that will help.

Cheers,
Ros.

comment:13 Changed 8 years ago by SimonDriscoll

Hi Ros,

this is brilliant. Thanks. So just to be sure you recommend I start completely afresh from my fine MONSooN start dump (compile, reconfigure, NRUN, then CRUN for twenty years) with the relevant stash diagnostic turned off in the stash?

Thanks again.

Simon

comment:14 Changed 8 years ago by SimonDriscoll

Hi Ros,

just to check. You say "This DIATOM-CHLORO diagnostic is known to cause problems on HECToR and is switched off in all the standard jobs." However, my run is a copy of the standard job (if I assume you to be talking about the same thing) xgadb which I obtained here: http://cms.ncas.ac.uk/index.php/um-configurations/1546-hadgem2-jobs

?

Simon

comment:15 Changed 8 years ago by SimonDriscoll

Hi both,

just to inform, a cp of xgyte, xgytf, is running now on CRUN to test this. So far I have three months of data. I'm not completely sure what fields I should look in for NaN's now (and looking in all individually would take an age…), but a quick check of randomly selected fields hasn't showed any NaN's.

Simon

comment:16 Changed 8 years ago by willie

Hi Simon,

A good way to check for nans is to cumf the output dump with itself: there should be no differences

Willie

comment:17 Changed 8 years ago by SimonDriscoll

Hi Willie,

great, thanks. I've let the new run with the relevant stash turned off. It started in Dec 2080 and is now at Nov 2081. I chose one of the start dumps, corresponding to Jul 2081 and then did the cumf command.

I don't know cumf, but I google searched it, hopefully the following command is correct:

cumf -dOUT /work/n02/n02/sdrisc/xgytf xgytfo.das1710 xgytfo.das1710

I then get:

Summary in: ,/work/n02/n02/sdrisc/xgytf/cumf_summ.22132
Full output in ,/work/n02/n02/sdrisc/xgytf/cumf_full.22132
Difference maps (if available) in:,/work/n02/n02/sdrisc/xgytf/cumf_diff.22132

Then opening the difference maps, I get lots of stuff like:

Field 2698 : Stash Code 32301 : Ice Pressure (prss)

Grid type = 41
OK

But I don't know quite what I'm looking for in there (obviouslt, just seeing one difference would be enough, but I can't explicitly see a reference to any of these actually being a difference, and the word OK generally seems to suggests things are fine).

Could you look in the file/tell me precisely what to look for to confirm things are fine?

Best,

Simon

comment:18 Changed 8 years ago by willie

Hi Simon,

You only need to look at the summary. Unfortunately, there are 40 differences, all the same field,

Stash Code 120 : OT18: DIATOM-CHLORO; CFC12; EXTRA-C

As Ros has mentioned, this field is problematic and should be switched off in the STASH.

comment:19 Changed 8 years ago by SimonDriscoll

Hi both,

I believe it is turned off in the stash, else I am doing something wrong.

So in the umui Ocean GCM —> Stash —> Stash. Specification of Diagnostic Requirements. —> Then I switch "0. 120 OT18: DIATOM-CHLORO; CFC12; EXTRA-C "'s include option from Yes to No.

I assume this is right?

Thanks,

Simon

comment:20 Changed 8 years ago by SimonDriscoll

Hi both,

and just to clarify this is with a completely new run xgytf, not copied from the other run xgyte (so I'm not accidentally considering the wrong files/dumps etc. when the stash was turned on).

All the best,

Simon

comment:21 Changed 8 years ago by SimonDriscoll

Hi guys,

so I've created two runs, xgytf and xgytg. The original intention of xgytg was to turn off the ocean stash in a run completely separate from the original problem run xgyte.

I thought I did this with xgytf from pre-compilation stage:

"I believe it is turned off in the stash, else I am doing something wrong.


and just to clarify this is with a completely new run xgytf"

However, WIllie said:

"Hi Simon,

You only need to look at the summary. Unfortunately, there are 40 differences, all the same field,

Stash Code 120 : OT18: DIATOM-CHLORO; CFC12; EXTRA-C

As Ros has mentioned, this field is problematic and should be switched off in the STASH."

I did a copy of this, xgytg and I can say that definitely this is a run with the stash turned off from pre-compilation, reconfiguration and run stage.

The output from cumf is as follows:

sdrisc@hector-xe6-9:/work/n02/n02/sdrisc/start_dumps> cumf -dOUT /work/n02/n02/sdrisc/start_dumps xgytgo.das2c10 xgytgo.das2c10
Summary in: ,/work/n02/n02/sdrisc/start_dumps/cumf_summ.14710
Full output in ,/work/n02/n02/sdrisc/start_dumps/cumf_full.14710
Difference maps (if available) in:,/work/n02/n02/sdrisc/start_dumps/cumf_diff.14710

However, I'm still not quite sure what I should look at in all of the output.

Does this run, as well as xgytf, still apparently complain about the stash despite it being turned off?

Both xgytf and xgytg are running without fault, xgytf for over 8 years, xgytg for over 3 years, so there doesn't seem to be any major issue making them crash.

Warm regards,

Simon

comment:22 Changed 8 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.