Opened 4 years ago

Closed 4 years ago

#1664 closed help (worksforme)

UM jobs crashing after one day

Reported by: peatman Owned by: um_support
Component: UM Model Keywords:
Cc: s.j.woolnough@… Platform: ARCHER
UM Version: 8.2

Description

I am having trouble with some UM jobs on ARCHER which keep crashing after 1 day of model time. There seems to be something wrong with the moisture conservation, giving strange values on the 72nd time step (with a time step of 20 minutes) which then cause the model to blow up on the 73rd time step, e.g.:
Atm_Step: Timestep 73 Model time: 2008-12-28 12:20:00
Qcf < 0 fixed by PC2 940
Qcl < 0 fixed by PC2 861

==============================================
initial Absolute Norm : 10139975.439437283
GCR( 2 ) failed to converge in 100 iterations.
Final Absolute Norm : 209889.38249995536
==============================================

Q_POS: unable to conserve in 24529 columns
Q_POS: unable to conserve in 26069 columns
Q_POS: unable to conserve in 22529 columns

I have run several different jobs from experiment xlsu (vn8.2, N96, aquaplanet runs) and all have failed except xlsua, xlsub and xlsuc, which all ran successfully for 3.5 years when I gave them the start dump /work/n02/n02/peatman/um/dumps/xiurl.astart. Strangely, xlsua failed on the 73rd time step when I used the start dump /work/n02/n02/peatman/um/dumps/xiurea.da20070112_12 but changing to the xiurl.astart dump seemed to fix the problem. However, xlsud, xlsue, xlsug and xlsuh all failed in the same way when I used xiurl.astart, so the choice of start dump turned out not to be the problem after all. Even more strangely, today I reran xlsua (changing to a 2-day run and switching off all STASH output and dumping, just to speed things up) and it failed! So the problem seems to be intermittent…

My xlsu jobs include my own code branch which changes the humidity seen by the entrainment scheme, but this isn't the problem because I created a job xlwca which does not contain that branch and this also crashes in the same way. I've also found that the model crashing does not depend on whether or not I run the reconfiguration.

Every 24 hours the .leave files contain a list of values, output by eng_mass_diag, relating to things like energy correction and moisture conservation. For the time step before the one which blows up, these values are along the lines of:
Final moisture = 0.11793E+17 KG
Initial moisture = 0.30365E+20 KG
change in moisture = -0.30353E+20 KG
Usually (i.e., in runs which don't blow up) the first two values would both be on the order of 0.1E+17 and the third one of the order of ±0.1E+14.

I haven't managed to find anyone else who has had the same problem, even though I've managed to demonstrate that it's not my own code changes which are responsible. Steve (cc'd) and I can't tell where the problem lies; are you able to shed any light on what is going wrong? Please could you cc Steve in on any replies.

Many thanks,
Simon

Change History (23)

comment:1 Changed 4 years ago by willie

Hi Simon,

Job xlsuj has segmentation faults. These may be due to the failure to converge. Try halving the time step and in science section 13 press DIAG_PRN and set it the print frequency to every time step rather than every 72.

Regards

Willie

comment:2 Changed 4 years ago by peatman

Willie,

All of the jobs I have mentioned which don't converge on the 73rd time step crash by seg faulting, it is not unique to xlsuj. However, I shall try changing the settings you have suggested.

Simon

comment:3 Changed 4 years ago by peatman

Willie,

I have now done this - see xlsuj000.xlsuj.d15267.t115251.leave The model now crashes on the 145th time step.

I have also tried a run in which I change the energy correction (section 14) to every 12 hours rather than every 24 (with a 20-minute time step again) and it crashes on the 37th time step - see xlwca000.xlwca.d15267.t121747.leave So it seems it is the energy correction which messes things up, causing the crash one time step later. However, it is still confusing that it is always the *initial* moisture value which looks spurious in the energy correction output, not the final moisture value.

Simon

comment:4 Changed 4 years ago by willie

Hi Simon,

Thanks. Something nasty is happening at t/s 145: the temperature exceeds 488K and the vertical wind velocity is unphysically large. Could you repeat the run with dumping at t/s 144, 145 please. This is done in Control> Post Proc > Dumping and meaning. Select irregular dumps, press next and enter 144, 145 in the table and check "timesteps". Then we'll be able to look at the dump and see if we can spot any problems.

Regards

Willie

comment:5 Changed 4 years ago by peatman

Willie,

When I do this, because the file names of the dumps contain time stamps as YYYYMMDD_HH, the second dump will overwrite the first from 10 minutes earlier. Is there a way of changing the file naming convention for the dumps?

Simon

comment:6 Changed 4 years ago by willie

Simon,

In Input/Output? > Time convention, select relative time in time steps.

Willie

comment:7 Changed 4 years ago by peatman

Willie,

Thanks - I've now run this and the .leave file is xlsuj000.xlsuj.d15267.t143204.leave The dumps are located in /work/n02/n02/peatman/um/xlsuj

Simon

comment:8 Changed 4 years ago by grenville

Simon

Please point me to leave files for jobs which didn't fail.

Grenville

comment:9 Changed 4 years ago by peatman

Grenville,

If you do an ls -rtl on my /home/n02/n02/peatman/output directory, all the .leave files from Sep 10 18:15 to Sep 13 01:37 were successful runs, and also Sep 14 22:32 to Sep 16 12:21.

Simon

comment:10 Changed 4 years ago by willie

Hi Simon,

The only difference I can see between xlsuc (which works) and the xlsuj (which fails) is that

/work/n02/n02/peatman/um/ancillaries/xlsua.q.tMean.ancil

is replaced by

/work/n02/n02/peatman/um/ancillaries/xlsub.q.tMean.ancil

i.e. the specific humidity climatology. I've checked the second for NaNs? but it is OK. Perhaps some strange values have crept in?

Regards,

Willie

comment:11 Changed 4 years ago by peatman

Willie,

I know that the different ancillary file is not the problem. The specific humidity data are needed for my entrainment code changes, but as I've explained above the same problem exists whether or not I include those code changes (and when I don't include them, I don't provide the ancillary file either).

Simon

comment:12 Changed 4 years ago by willie

Hi Simon,

Which is the failing job that doesn't have your code changes or ancillary file?

regards

Willie

comment:13 Changed 4 years ago by peatman

Willie,

xlwca doesn't have my code changes in it. Note that the .leave files from it contain lots and lots of rows of numbers which can be ignored - they are coming from a WRITE statement in fcm:UM/branches/dev/swr05npk/vn8.2_aquaplanet_fixedeqx/src

Simon

comment:14 Changed 4 years ago by willie

Hi Simon,

xlcwa fails to converge at time step 4 and this causes a segmentation fault. To get further you could make the following changes

  • On the Change default output file names panel switch this off as you haven't specified any changes
  • In the Input/Output? > Script Modifications panel, add the environment variable ATP_ENABLED and set it to one: this will give a detailed traceback
  • In Scientific parameters > Section by Section > Section 13, press DIAG_PRN and enable printing every time step
  • dumping every time step as we did before can also help

and rerun.

Comparing with xlsua, which worked, some branches have been removed and others switched on, there is more rapid calling of the energy correction and the specific humidity ancillary is no longer used. Any of these could cause the problem. I am no expert in aquaplanet models and cannot advise on the details of the code.

I hope that helps.

Regards

Willie

comment:15 Changed 4 years ago by grenville

Simon

I stripped out all the branches bar the ncas branch (this is now the 8.2 model we use for UM training and runs a global model with energy corr (you may recall)) - it fails with my aguaplanet start dump and your dump xiurl.astart — in the same way (I believe its not calculating the initial energy correctly).

Running with /work/n02/n02/lboljka/startdump_ancillary/aquapStart.da20090112_12 (and reconfiguring it) works just fine — the initial energy looks OK and the energy correction looks ok too see xlwhj000.xlwhj.d15274.t160848.leave.

Can you use this start dump?

comment:16 Changed 4 years ago by peatman

Grenville,

When I try the new start dump I get an "Atmosphere basis time mismatch" error so presumably there's something I'm failing to change properly in the UMUI (see xlsuj and xlwca - xlsuj000.xlsuj.d15275.t123222.leave and xlwca000.xlwca.d15275.t115611.leave). Can you see what I've done wrong?

Simon

comment:17 Changed 4 years ago by grenville

Simon

Just set the time in input/output control→start date..

set to 2009, 1, 12, 12 (see by job xlwhj)

Grenville

comment:18 Changed 4 years ago by peatman

Grenville,

Yes, I'd already done that but I still get the error. I was wondering if there was any other setting I need to change.

Simon

comment:19 Changed 4 years ago by grenville

Simon

Not sure - try unchecking

Override year in dump with..
Resetting data time to verification …

in the reconfiguration → general reconf options

Grenville

comment:20 Changed 4 years ago by peatman

Grenville,

Thanks for your email - I've now sorted the start dump file names. Using Lina's start dump I've managed to run the model for 2 whole days without it blowing up. I'm now setting it to do my full 3.5 year run and I'll report back whether it works.

Even if it does work, however, I'm still a bit concerned. As I said in my original post, changing the start dump seemed to fix the problem before but then the problem returned. There's no guarantee, therefore, that this new start dump will continue to work indefinitely.

Simon

comment:21 Changed 4 years ago by peatman

Grenville,

xlsuj ran successfully for 3.5 years over the weekend so we can tentatively say that this new start dump has fixed the problem. I'll let you know if it starts to go wrong again.

Thanks for your help,
Simon

comment:22 Changed 4 years ago by grenville

Simon

OK that's good, we're looking at the reconfiguration as the source of the bug - if the model does misbehave, please try to leave it in a state where we (you) can reproduce both working and failing versions. That way we have a better chance of identifying the problem.

Grenville

comment:23 Changed 4 years ago by grenville

  • Resolution set to worksforme
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.