Opened 2 months ago

Last modified 2 months ago

#2495 new help

New suite failing at recon

Reported by: charlie Owned by: um_support
Priority: normal Component: UM Model
Keywords: Cc:
Platform: NEXCS UM Version: 10.7

Description

Hi again,

Sorry, my new suite which I started this afternoon, has again failed, this time at the recon stage. I have had a look at the error logs, but again can't see what the problem is.

My new suite is u-ay672, and it is an identical version of u-aw739 (which works fine and has run for 20 years). The only difference between the 2 is that my new suite uses fewer aerosol emissions files, and those that it does use I have modified. The aerosol emissions files are all in netcdf format, which surprised me first as I thought all ancillary files had to be in UM format, but this is exactly the same in all of my other suites, which work. It all of my suites, the aerosol emissions files are not UM format.

Please can you advise?

Thanks,

Charlie

Change History (15)

comment:1 Changed 2 months ago by grenville

Charlie

This is were the problem occurs - (you had the same error earlier in ticket 2480)

/home/d05/cwilliams/cylc-run/u-ay672/share/fcm_make_um/preprocess-recon/src/um/src/control/dump_io/chk_look.F90 line 94

I can only suggest following the same advice.

Grenville

comment:2 Changed 2 months ago by charlie

Thanks Grenville, I have just turned on the extra diagnostics as before, and resubmitted, so hopefully that will tell us exactly which ancillary file is causing the problem.

Last time, the issue was relatively simple, as it could only be the SST ancillary (because that's all I had modified) and the answer was because of a schoolboy error on my behalf, not checking the times correctly and not realising that xconv changes the times when converting from UM format to netcdf.

This time, however, there are 7 new ancillary files, so it could be any one of them. Moreover, it won't be the same problem as last time, because the files were already in netcdf so I didn't need to convert them from (and back to) UM format. I have doublechecked as much meta data as I can see in xconv, comparing my files to the originals, and they look identical. But presumably some other meta data, that I can't see, is wrong? Hopefully the extra diagnostics will tell us which one and why. When the job fails again, where should I look to get these extra diagnostics? And where did you look to get the above error?

Many thanks,

Charlie

comment:3 Changed 2 months ago by charlie

Okay, it has now failed again at the same stage. I have looked at the error output file, as well as the

/home/d05/cwilliams/cylc-run/u-ay672/share/fcm_make_um/preprocess-recon/src/um/src/control/dump_io/chk_look.F90

But nowhere does it seems a exactly which of the ancillaries is causing the problem, and why. Is this specified somewhere else?

comment:4 Changed 2 months ago by grenville

The last file it was reading was /home/d05/cwilliams/ga71/ancils/vegfrac/qrparm.veg.frac which has inverted latitudes.

Please check all your files

comment:5 Changed 2 months ago by charlie

Hi Grenville,

Now this is just weird, and is not just a schoolboy error like last time. In fact it is something I have been worried about for some time. The Eocene version of that file, which Will made and which works with the model (I have run it for 20 years using this) is upside down relative to what I always expected i.e. it goes north to south. Every other ancillary file goes south to north. When I questioned him about this, he said that was normal and it should be. Plus, if I check what is coming out of the restart dump, it is the right way up. So he said that the model was able to reverse it internally, and it works.

Therefore, when it came to modifying this file further, I naturally assumed he was right so I kept the latitudes the same i.e. north to south. I was always worried about this, because if I compare this to the modern version of this ancillary, the modern version is the right way up i.e. south and north. And now you are saying that the model is falling over because it's the wrong way up. So how can the Eocene version work, which is upside down, but my version not work?

Either way, I will try reversing it again (so it is consistent with all the other ancillary files) and see if that works.

Charlie

comment:6 Changed 2 months ago by charlie

Hi again,

Right, it has now got past the recon stage and has failed at the atmos_main stage. There are numerous errors in the output log but I'm not sure which is the problem?

Charlie

comment:7 Changed 2 months ago by grenville

I can only see log files for a failed reconfiguration stage for u-ay672 (from June 13 at 08:58)

u-aw739 stopped in time step 0, which is indicative of a problem with input data. How does the start data differ cf the successful run?

comment:8 Changed 2 months ago by charlie

Sorry Grenville, my mistake, please ignore the last comment (comment 6). I mixed up my suites.

I have now resubmitted u-ay672 and it got past the recon stage, but failed again at exactly the same point of the atmos_main stage. I think I have found the error, which points me to one of the known failure points at https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints:

Convergence failure in BiCGstab, omg is too small
Why?: This is a trap within the solver for the case where the solution method is breaking down (stagnating) and is a controlled exit to avoid floating point underflows or divisions by zero. Because of this checking it has become a general failure point where a NaN (or large unphysical increment) has been generated in a physics scheme (or read in from an input file) and has subsequently been passed to the dynamics. It can also be a failure point if the resolution is very anisotropic, e.g. high vertical resolution and low horizontal resolution.

How to investigate?: Run the model with output diagnostics set to high ([env]PRINT_STATUS=PrStatus_Diag) as this switches on the summary information for physics increments. This will identify if a NaN has been generated by a physics scheme and allows you to narrow down where the problem is.

However, I confess I don't understand what this means. I already have my extra output diagnostics turned on, so would you be able to advise whether or not this is saying where the problem lies?

I have doublechecked my aerosol ancillary files (which I modified) and am certain that none of them contain NaNs? (which is what the above error suggests) so don't think that's the problem.

Charlie

comment:9 Changed 2 months ago by grenville

It's not referring to NaNs? in the ancillary files. The NaNs? come about in calculations in the model - which, at time step 1, likely occur because of problems with input data. I think you have demonstrated this - the model ran OK with a different set of start data?

You need to set PRINT_STATUS to PrStatus_Diag (you may have previously set reconf. print status.)

Do you have start file for the job which ran successfully?

comment:10 Changed 2 months ago by charlie

I don't understand. The start data between u-aw739 (which ran for 20 years when using the standard (unmodified) aerosol emissions files) and u-ay672 (which fails at the first time step) are exactly the same. I haven't done anything to the input start files. The only difference between these 2 suites is a slightly modified vegetation fraction ancillary file, which was causing a problem because it was upside down but that's now been rectified) and the 7 new aerosol emissions ancillary files. These files are based on the existing preindustrial control ancillary files, but modified by me to be Eocene (actually a zonal mean of the original, calculated over land and sea separately). Otherwise, the 2 suites should be identical.

Charlie

comment:11 Changed 2 months ago by grenville

That's a lot of differences - maybe not to the start file, but a whole different set of emissions files.

comment:12 Changed 2 months ago by charlie

Hi Grenville,

Right then, possibly getting somewhere, although I don't know why.

For now, let's forget about my new suite (u-ay672) with modified aerosol emissions. Let's just focus on u-aw739 and u-ay314. The first is my Eocene suite (i.e. with all the modifications to the land mask etc), and the 2nd is a modern suite (i.e. using all standard files for everything). The Eocene suite ran for 20 years initially when it was using the modern aerosol emissions. 3 of these emissions were timeseries ending in 2010, so it failed when it ran out of time. In contrast, the modern suite uses all climatology versions of the aerosol emissions, and it is working fine (currently 8 years in).

I have just tried running u-aw739 again, from the beginning, but using the climatology versions of these files, in other words all of my emissions files, as specified by ~/roses/u-aw739/app/um/rose-app.conf (specified directly, not using any environment variables), can be found in /projects/um1/ancil/atmos/n96e/ukca_emiss/cmip5/2000/v2/. If I run with this, it again fails at the very first time step of the atmosphere_main stage and gives me the same error about trying to create NaNs? (i.e. the same error I got with my modified aerosol emissions in u-ay672).

So this implies, to my mind, that it's not my modifications which are the problem, but rather these 3 aerosol emissions. It ran fine when I used the timeseries versions, but not the climatology version.

To complicate things even further, however, is that my modern suite (u-ay314) uses exactly the same aerosol emissions files as I have just tried with u-aw739, but works and doesn't give this error. In other words, the Eocene suite works with the timeseries versions but not the climatology versions, whereas the modern suite works fine with the climatology versions.

So this implies, again to my mind at least, that there is something in the Eocene configuration which is conflicting with the climatology versions of one of these 3 aerosol emissions. In the modern suite, this doesn't exist and so it works with the climatology versions.

Does that make sense? What do I need to turn on in terms of diagnostics to investigate this further i.e. to find out exactly why the Eocene suite doesn't like one of these climatology aerosol emissions whereas the modern suite doesn't mind?

Charlie

comment:13 Changed 2 months ago by grenville

Did you set RINT_STATUS to PrStatus_Diag for the models?

comment:14 Changed 2 months ago by grenville

PRINT_STATUS that should say

comment:15 Changed 2 months ago by charlie

I don't seem to have a PrStatus_Diag available. If I search for PRINT_STATUS it gives me the atmosphere only tab, but then the options are Minimal, Normal, Extra. I have just changed this to Extra, I'm assuming this is the same thing? Assuming yes, I have just resubmitted my job and it has failed at the same point. Do these extra diagnostics shed any light?

Note: See TracTickets for help on using tickets.