Opened 10 months ago

Closed 9 months ago

#2794 closed help (fixed)

Increasing the model time step to get over an error?

Reported by: ha392 Owned by: um_support
Component: UM Model Keywords: HadGEM3-GC3.1
Cc: Platform: ARCHER
UM Version: 10.7

Description

Hello,

My model has stopped working due to an error that we believe we can get over by increasing the time step (changing the model without really impacting the climate output to much). People in my office have done this for previous versions of the model on umui but not the new models on rose.

As it is an atmospheric error, I have tried to increase the atmospheric time step by 12,24 and so on, however, I get a new error whereby the model simply does not work if I change the time step as some have done.

I have been looking in the work guide but cannot find anything that might help.

Is there a rule with increasing the time step and needing to change anything else in the model (for expample- if I increase the atmospheric time step so I also have to increase the ocean as well? Or change the spatial resolution as well as the temporal one?). Or is there anything else that might work that you have come across.

Note- the model had been running fine for 138 years, so it was not an early blow up.

Thank you,
Holly

Change History (27)

comment:1 Changed 9 months ago by willie

Hi Holly,

Could you post the before and after errors, the suite id and UM version please?

Willie

comment:2 Changed 9 months ago by ha392

Hello,

The suite id is u-bc185, UM version 10.3.

The error for the blow up is:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 100
?  Error from routine: set_thermodynamic
?  Error message: A total of    58 points had negative mass in set_thermodynamic. This indicates the pressure fields are inconsistent between different levels and the model is about to fail.
?  Error from processor: 60
?  Error number: 11
????????????????????????????????????????????????????????????????????????????????

[65] exceptions: An non-exception application exit occured.
[65] exceptions: whilst in a serial region
[65] exceptions: Task had pid=30905 on host nid01086
[65] exceptions: Program is "./atmos.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
Rank 65 [Wed Feb 20 14:51:43 2019] [c5-0c1s15n2] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 65
Rank 66 [Wed Feb 20 14:51:43 2019] [c5-0c1s15n2] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 66
_pmiu_daemon(SIGCHLD): [NID 01086] [c5-0c1s15n2] [Wed Feb 20 14:51:43 2019] PE RANK 65 exit signal Aborted
[NID 01086] 2019-02-20 14:51:43 Apid 33582694: initiated application termination
[FAIL] run_model # return-code=137
Received signal ERR
cylc (scheduler - 2019-02-20T14:52:01Z): CRITICAL Task job script received signal ERR at 2019-02-20T14:52:01Z
cylc (scheduler - 2019-02-20T14:52:01Z): CRITICAL failed at 2019-02-20T14:52:01Z

I do not have the error for after trying to change the atmospheric time step, but it was a quick termination where it simply did not like me changing it (possibly to do with the resolution?). The above error is the one I would like to get over to continue the model run, so any suggested changes on how to get over a blow up would be helpful.

Thank you for your help,
Holly

comment:3 Changed 9 months ago by ha392

Error in the previous comment. The model is 10.7 not 10.3.

comment:4 Changed 9 months ago by willie

  • Keywords HadGEM3-GC3.1 added
  • UM Version set to 10.7

Hi Holly,

This type of error is described at https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints:

Why?: r2_set_thermodynamic sets up the thermodynamic fields into appropriate columns for the radiation scheme to act on. At one point it uses hydrostatic balance to make an estimate of the grid-box mass, but this can be negative if the input fields are seriously corrupted. This is a catch-all failure point for bad inputs and is not a problem with the radiation scheme.

How to investigate?:

Firstly, rerun the forecast without any changes. If this runs through OK, then there could be hardware issues with the compute nodes that ran the original forecast. Please report these nodes to the system administrators of your HPC.
If the forecast fails reproducibly, then another cause of this error is when an input file has been corrupted due to a hardware issue writing or copying the file. If you identify that file corruption has occurred then, depending on the source of the file, it may be appropriate to report the issue to the system administrators of your HPC.

I quickly checked a couple of your start dumps and they seemed OK. You can check for NaNs by using mule-cumf to compare a dump with itself: NaN-free dumps show no differences.

Normally, if there has been a grid point storm, this can be solved by reducing the atmosphere model time step. This can be done in Run Initialisation and Cycling → Atmosphere Time Steps per Day: change the 72 to 144 to halve the time step. But I'm not convinced that is the solution.

Willie

comment:5 Changed 9 months ago by ha392

Hi Willie,

Thank you for this. I have tried to run again from the previous cycle and I get the same error, so I am assuming that it must be a corrupt file. This happened first on the same day when Archer went down on the 15th Feb, so I might have been right with my first ticket when I thought it could have been to do with that. I have no idea where to start, but I will try to look into where this could have happened.

Thank you,
Holly

comment:6 Changed 9 months ago by willie

Hi Holly,

Ah! I think you are referring to ticket:2779. I hadn't made the connection.

If the problem was due to an ARCHER failure, then you need to restart the suite at a suitable point before the failure - see the advice at https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral, especially the section on starting and restarting the suite.

Willie

comment:7 Changed 9 months ago by ha392

Hi Willie,

Okay, I am a little confused on what would be a 'suitable point'. I had restarted the model from the beginning of a previous cycle which worked fine until it met this error again on the same cycle as where it had failed before before.

Thank you,
Holly

comment:8 Changed 9 months ago by willie

Hi Holly,

I was really thinking of this statement in advice at comment:6,

Occasionally, following a failure, the component models of a coupled run can get out of sync, or if using the GC3 drivers, out of sync with the current cycletime, resulting in an error …

The steps outlined there should be considered.

Willie

comment:9 Changed 9 months ago by ha392

Hi Willie,

Ok, I have checked through the restart files as mentioned, they appear to all be in sync with the cycle that fails (19890901T0000Z), and restarting the model produces the same error as above from both this point and when I try to run from a previous cycle (using warm run) which runs through fine.

Would trying to do the restart from archived restarts make a difference?
Or should I try as above but deleting all of the 19890901T0000Z restart files so that it does a full restart from the previous cycle (19890801T0000Z) that does run fine (although I am not sure how this would be better simply doing a warm start as all 19890901T0000Z files should be overrun?).

I have checked through the most recent start dumps using mule-cumf and have not found any differences. But if the error means there is a corrupt file then I am unsure.

Holly

comment:10 Changed 9 months ago by ha392

Hi Willie,

Would it be possible to have a little more guidance on this? After checking the start dumps, I am still unsure on what to do next, as I did not find any corruptions and it fails in the same spot after running from a previous cycle. Are there any other tickets with this problem that it might be worth looking through?

Thank you,
Holly

comment:11 Changed 9 months ago by ha392

Hello,

An update on what I have tried-

After trying to find the problem and restart the model as above (to no luck), I decided to up the time steps from 72 to 144, which worked fine for a couple more cycles, seemingly fixing the problem.

However, when I reduced the time steps back to 72, I am now getting a new error that I have no idea how to solve and did not have much luck with searching other tickets-

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 2
?  Error from routine: U_MODEL_4A
?  Error message: ACUMPS1: Partial sum file inconsistent. See Output
?  Error from processor: 117
?  Error number: 12
????????????????????????????????????????????????????????????????????????????????

[123] exceptions: An non-exception application exit occured.
[123] exceptions: whilst in a serial region
[123] exceptions: Task had pid=30740 on host nid03991
[123] exceptions: Program is "./atmos.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
Rank 123 [Tue Mar 12 20:02:46 2019] [c4-2c2s5n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 123

This error appears to be to do with the 'Partial sum file'?
It appears to be failing right at the end of the job, coupled cycle 19900101T0000Z, and occurs at multiple processes (line 626 onward).

http://puma.nerc.ac.uk/rose-bush/view/ha392/u-bc185?&no_fuzzy_time=0&path=log/job/19900101T0000Z/coupled/05/job.err

Thank you,
Holly

comment:12 Changed 9 months ago by ha392

Note- this error occurred after restarting the model with time steps reduced back from 144 to 72. I am wondering if it has anything to do with the restart and not the model its self.

Holly

comment:13 Changed 9 months ago by ha392

Hi,

Looking at other tickets, there seems to have been some luck with perturbing the model after this error, or perhaps increasing the timestep for a couple more cycles? However I do not have access to the perturb_theta.py file. I would like to try using the peturb method if possible to access this?

Thank you,
Holly

comment:14 Changed 9 months ago by ha392

Hi,

I realised that perturbing had the same effect as decreasing the time-step, and the new error was to do with the partial sum files, so I moved the suiteID_s* and cycleID_suiteID_s* files in History_Data/. (Ticket 2397 Comment 13).

However, I am now getting a new error.

BUFFIN: Read Failed: No such file or directory
[1]
[1] ????????????????????????????????????????????????????????????????????????????????
[1] ???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
[1] ?  Error code: 22
[1] ?  Error from routine: io:buffin
[1] ?  Error message: Error in buffin errorCode= 0.00 len=0/512
[1] ?  Error from processor: 0
[1] ?  Error number: 0
[1] ????????????????????????????????????????????????????????????????????????????????

I have no idea which file or directory is missing. This error is a little more ominous than the last.

Holly

comment:15 Changed 9 months ago by ha392

I replaced the cycleID_suiteID_s*files for the previous cycle. I am now getting a slightly different error. I believe that this error is occurring in the last process as I am still getting output files and the model run for full length of time. I am wary of trying much else by myself now as these errors are quite expensive to get to at this point. Any help here would be greatly appreciated.

BUFFIN: Read Failed: Success
[1]
[1] ????????????????????????????????????????????????????????????????????????????????
[1] ???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
[1] ?  Error code: 22
[1] ?  Error from routine: io:buffin
[1] ?  Error message: Error in buffin errorCode= 0.00 len=0/512
[1] ?  Error from processor: 0
[1] ?  Error number: 0
[1] ????????????????????????????????????????????????????????????????????????????????


comment:16 Changed 9 months ago by grenville

Holly

please set PRINT_STATUS to Extra diagnostic messages & rerun.

Sounds like you found perturb_theta.py?

This error - Error message: ACUMPS1: Partial sum file inconsistent. can be fixed by changing the stash to not include the diagnostic causing the climate meaning error - you'd need to handle that diagnostic separately.

Grenville

comment:17 Changed 9 months ago by ha392

Hi Grenville,

Currently running with extra diagnostics, I will update with output.

Not quite, but I don't think that would have much of a different outcome from what I did with increasing the number of timesteps, so I will leave that idea for now.

Okay, hopefully I will get a useful output now and will work from there.

Thank you,
Holly

comment:18 Changed 9 months ago by ha392

Hi Grenville,

Extra diagnostics have been run. What would you suggest?

u-bc185 19900101T0000Z

Thank you,
Holly

comment:19 Changed 9 months ago by grenville

Holly

I'm not sure that moving the partial sum files was a good idea - it looks like the model expects something in a partial sum file which is empty.

In trying to solve …58 points had negative mass in …, another problem has been created. I'm tempted to suggest that you restart the model (I'd create a new suite for this) from a known good state. What data would you lose if you started a new run the latest good start data - you may lose the most recent yearly means, the most recent seasonal means, and the most recent monthly means — so in essence just the most recent monthly means (you can recreate the others - we suggest that you switch off 10-year means.) It looks like your model started the 19900101T0000Z cycle OK -I'd try starting a new suite at that point - that way you are not carrying around the climate meaning problem.

Grenville

comment:20 Changed 9 months ago by ha392

Hi Grenville,

I have a new suite setup for the restart, as I want to make sure I do this right, what sections do I need to change when setting up the start dumps to go from 19900101T0000Z? Also, for turning off 10-year means, is that in dumping and meaning?

(Sorry if basic questions, I just know I would do something wrong).

Thank you,
Holly

comment:21 Changed 9 months ago by grenville

Holly

You'll need to set the Model basis time to 1990 01 01

atm start dump to /work/n02/n02/ha392/cylc-run/u-bc185/share/data/History_Data/bc185a.da19900101_00

NEMO start dump to /work/n02/n02/ha392/cylc-run/u-bc185/share/data/History_Data/NEMOhist/bc185o_19900101_restart.nc

NEMO icebergs start dump to /work/n02/n02/ha392/cylc-run/u-bc185/share/data/History_Data/NEMOhist/bc185o_icebergs_19900101_restart.nc

CICE start dump to /work/n02/n02/ha392/cylc-run/u-bc185/share/data/History_Data/CICEhist/bc185i.restart.1990-01-01-00000.nc

Passive tracers restart file to /work/n02/n02/ha392/cylc-run/u-bc185/share/data/History_Data/NEMOhist/bc185o_19900101_restart_trc.nc

I'd switch off climate meaning (set l_meaning_sequence to false) until you are confident the model is going - I'd also set it up for a short run initially (maybe just a few days, with a short wallclock limit to help it through the queues.

I'll probably have forgotten something but please try this.

Grenville

comment:22 Changed 9 months ago by ha392

Hi Grenville,

So the good news is that the new model ran successfully for a month. However, upon restarting the model with the extended time and turning back on meaning and dumping, I am now getting a new error. Suite u-bh072, 19900201T0000Z.

Thank you,
Holly

comment:23 Changed 9 months ago by grenville

Holly

It's failed in meanctl.F90 - I'm not sure why. Please try switching on means and running from 19900101T0000Z.

Grenville

comment:24 Changed 9 months ago by ha392

Hi Grenville,

For some reason the model will not let me warm or cold start it.

[FAIL] ssh -oBatchMode=yes login.archer.ac.uk bash --login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ --name=u-bh072\ --run=run\ --remote=uuid=47efbd15-67c0-4212-a335-f0e7b411d51c,root-dir=$DATADIR\' # return-code=2, stderr=
[FAIL] --------------------------------------------------------------------------------
[FAIL] This is a private computing facility. Access to this service is limited to those
[FAIL] who have been granted access by the operating service provider on behalf of the
[FAIL] contracting authority and use is restricted to the purposes for which access was
[FAIL] granted. All access and usage are governed by the terms and conditions of access
[FAIL] agreed to by all registered users and are thus subject to the provisions of the
[FAIL] Computer Misuse Act, 1990 under which unauthorised use is a criminal offence.
[FAIL]
[FAIL] If you are not authorised to use this service you must disconnect immediately.
[FAIL] --------------------------------------------------------------------------------
[FAIL]
[FAIL] [FAIL] 2019-03-21T15:01:16+0000 tar -czf log.20190320T160455Z.tar.gz log.20190320T160455Z # return-code=2, stderr=
[FAIL] [FAIL] 2019-03-21T15:01:16+0000 tar: log.20190320T160455Z.tar.gz: Wrote only 2048 of 10240 bytes
[FAIL] [FAIL] 2019-03-21T15:01:16+0000 tar: Error is not recoverable: exiting now

Thank you,
Holly

comment:25 Changed 9 months ago by grenville

Holly

Just a guess, delete log.20190320T160455Z and try again.

comment:26 Changed 9 months ago by ha392

Hi Grenville,

It has run a couple of years without problems, so I think it should be safe to close the ticket. Thank you for your help.

Holly

comment:27 Changed 9 months ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed

Hi Holly

I'm glad it's going. We never did get to the cause of 58 points had negative mass… let's hope it doesn't reappear.

Grenville

Note: See TracTickets for help on using tickets.