Opened 10 months ago

Closed 9 months ago

#3169 closed help (fixed)

Ongoing problems with vegetation/soils ancillary

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7

Description

Hi,

Sorry to bother you yet again with what feels like a seemingly endless problem, but I am still having problems running my suite with modified vegetation/soil ancils (specifically 4 of them: vegetation fraction, vegetation function, soil parameters and soil dust). My suite is u-br178, and the ancils are all on NEXCS at /home/d05/cwilliams/pliocene/gc31/ancils/vegetation and soils, respectively.

Unfortunately I can't try running my suite with each of these in isolation to find which one is the problem (e.g. my modified vegetation fraction and PI everything else), because my ice masks are different from the original versions. In other words, the model would crash straightaway if I did this, because for example my modified veg would have ice where the PI soil is saying it is ice, or vice versa. So I need to run with all of these 4 simultaneously.

I have doublechecked all of the schoolboy errors that were highlighted by previous tickets on this matter, i.e.

  • The masks do match my LSM
  • The ice masks agree between the veg and the soils i.e. soils are appropriate for where I have ice, and soils are also appropriate for where I don't have ice
  • All the tiles that need to sum to 1, do (e.g. with the fractions and soil dust)
  • There are no sharp gradients (or at least no sharper than in the original versions)
  • All values are within the range of the original values i.e. there are no extreme spikes
  • All missing data flags (e.g. over ocean) are currently set to 2.0000e+20

However, despite all these checks, it has yet again failed. I get multiple occurrences (i.e. on multiple processes) of the following error:

?  Error from routine: EG_BICGSTAB
?  Error message: Convergence failure in BiCGstab, omg is NaN
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.

I have seen this error before, and it usually occurs right at the first timestep (i.e. within the first 5 minutes of the coupled stage), and in the past has referred to some error in the creation of these ancils e.g. the mask not matching or similar. This time, though, this error came about 30 minutes into the coupled stage. Does this not imply that my ancils are basically okay, but a blowup is happening almost straightaway?

I did have one thought, which came about when converting my 4 files from netcdf into UM format, using xancil. When writing out to netcdf, I set my missing data flag (i.e. ocean) to 2.000e+20. For vegetation fraction, function and soil parameters, when these are then run through xancil, this flag is converted to -1.0737e+09. The files are still viewable using xconv, as it knows what the missing data are. However, this does NOT happen for soil dust. Here, when I run this through xancil, the missing data stay as 2.000e+20 even once it is UM format, and it is not viewable using xconv (i.e. because of the large values, the colour scale saturates out).

Might this be the problem? Having looked at the original PI version of soil dust, this is not the case i.e. it has -1.0737e+09 as missing data and is viewable. If this is the problem, how do I change xancil so that, just like the other 3, the missing data flag is changed? I have tried selecting "No mask" or "Land mask" or "Sea mask" in xancil, but whatever option I use the same problem occurs.

Or is this just a red herring?

Charlie

Change History (22)

comment:1 in reply to: ↑ description Changed 10 months ago by jeff

Hi Charlie

I did have one thought, which came about when converting my 4 files from netcdf into UM format, using xancil. When writing out to netcdf, I set my missing data flag (i.e. ocean) to 2.000e+20. For vegetation fraction, function and soil parameters, when these are then run through xancil, this flag is converted to -1.0737e+09. The files are still viewable using xconv, as it knows what the missing data are. However, this does NOT happen for soil dust. Here, when I run this through xancil, the missing data stay as 2.000e+20 even once it is UM format, and it is not viewable using xconv (i.e. because of the large values, the colour scale saturates out).

Might this be the problem? Having looked at the original PI version of soil dust, this is not the case i.e. it has -1.0737e+09 as missing data and is viewable. If this is the problem, how do I change xancil so that, just like the other 3, the missing data flag is changed? I have tried selecting "No mask" or "Land mask" or "Sea mask" in xancil, but whatever option I use the same problem occurs.

Or is this just a red herring?

It may well be a red herring, but when I convert the soil dust files to ancil format and set "Select Mask Type" to "Land Mask", the missing data is converted correctly.

Jeff.

comment:2 Changed 10 months ago by charlie

Hi Jeff,

Sorry for the delay. If I make the soil dust ancillary (using my job.soil_d.job in the above directory) and set the mask to "Land mask" as you did, then only the first field (clay) is converted correctly and is viewable afterwards. All of the other fields (silt, sand and the 6 dust divisions) remain as 2.0000e+20, and are not viewable. Is this what happened to you?

If not, can you send me your job, in case I am doing something different?

Charlie

comment:3 Changed 10 months ago by charlie

Further to this, I have just tried remaking my file, using exactly the same input .nc and selecting exactly the same input fields (starting from scratch i.e. not loading my job file), and now I can't even get the first field, clay, to convert correctly, like I did this morning. So not sure what's going on here! Have I missed something?

comment:4 Changed 10 months ago by charlie

Sorry, no, ignore that last comment. If I load my existing file, change to "Land mask", save it, then run xancil, it does at least convert the first field correctly. But still not the others, which remain 2.0000e+20.

comment:5 Changed 10 months ago by jeff

Hi Charlie

You have only set "Land Mask" for the first field, you need to do it for all 9 fields.

Jeff.

comment:6 Changed 10 months ago by charlie

Ah, yes, sorry, I just realised that literally seconds before your message arrived. Many apologies, I thought that was a global setting and would be applied to all fields. So I have now correctly made the ancillary and have restarted my suite, so I will let you know in the next 30 minutes or so if it still fails (i.e. if that was indeed a red herring)…

comment:7 Changed 10 months ago by charlie

… Nope, frustratingly that wasn't the problem, as it has just failed at virtually the same point (i.e. ~20 minutes into the coupled stage) with the same error as above. Do you have any ideas what might be causing this?

Charlie

comment:8 Changed 10 months ago by charlie

Hi again Jeff, or indeed anybody!

Now, this IS interesting. This afternoon, whilst hoping to get some further advice, I tried a couple of things. One of which I thought might well be the culprit, but turns out it wasn't (as it again failed with the same error at the same place), and another which I never expected to work, but so far appears to be.

The first is an idea I had when looking at the reconfiguration panel, at um > name list > Reconfiguration and ancillary control > Configure ancils… Unlike my vegetation fraction, soil parameters and soil dust (and indeed my other ancillaries e.g. orography), which are all set to "Initialise from ancillary file", my vegetation function was set to "Set to missing data". Indeed, when I checked the br178.astart, the fields within this ancillary were also completely missing. Therefore, I changed this suite, so that the function was also initialised from the ancillary, and resubmitted. I was fairly hopeful that this would work however, as I said, yet again I got the same error at the same location. If I check the new br178.astart, it does now reflect my changes, rather than being full of missing data, so am I right in thinking that even if this step is not causing the blowup, it is still necessary to do? As otherwise, surely, if I am not initialising from the ancillary file, then my changes are not been picked up? The only reason I originally had this option set to "Set to missing data" was because that's what the PI uses, but perhaps there were other reasons for doing this? Either way, even if I do indeed need to have this change, it is still failing.

However…

I then tried something else, which was going to be the following step once I got these 4 to work. As I said in my original message above, the biggest change to these 4 ancils is changing the ice mask, which of course involves changing the soil parameters and dust so that they match. I am very aware, however, that there are some other ancils which also depend on the ice mask, namely soil moisture and snow amount (in qrclim.smow) and deep soil temperature (in qrclim.slt). None of these are ancillaries in my suite, or indeed in the PI control - all of these must be brought in via the restart dump. I have already confirmed this with Till, and he traced back his various PI suites and none of them have these. He therefore concluded that these fields must have been introduced into the very early GC3 versions, and subsequently carried over in the restart dumps. Therefore, I copied my suite to a new one (br289), found the PI versions of these ancils, modified them accordingly so that they match my new ice mask, and then created new ancillary sections within the reconfiguration panel, setting both of these files to "Initialise from ancillary file". Much to my utter amazement, this is now working, and has been running for over an hour now. This is despite NOT making the first change above i.e. it is still not properly reading the vegetation function. But, despite this, it is running.

Does this make any sense to you? Do you think that this implies I need to stop my current suite (br289), make the vegetation function change as above, keep my new qrclim.smow and qrclim.slt, and then resubmit? I think I will try this right now anyway, and see if it still runs.

Many thanks, and apologies for the long message.

Charlie

comment:9 Changed 10 months ago by charlie

I spoke too soon. Literally a minute after I sent the last message, br289 has now also failed, again with the same error. But it still ran for just over an hour, which is much longer than before.

I think I'm going to resubmit it anyway, with the change to the function, and see what happens with all the changes…

comment:10 Changed 10 months ago by jeff

Hi Charlie

The reason veg.func was set to missing data in the start dump, is this field is updated i.e. update_anc = .true.. This means the model doesn't read this field from the start dump but from the ancillary files every period specified (5 days in this case). Therefore the field can be anything in the dump but it does need to be there for some reason.

As to the reason your model is crashing I'm not sure, but given that it doesn't crash immediately there is probably not a major problem with your ancillaries. Your model must be going unstable for some reason and blowing up, which can be really hard to pin down. How far is the model getting in model time and in timesteps? You could try to reduce the timestep and see if that gets over the problem.

Jeff.

comment:11 Changed 10 months ago by charlie

Hi Jeff,

Sorry for the delay - okay, the suite I submitted last night again failed again with the same error, about 2 hours into the run, so clearly changing the vegetation function wasn't the answer. So I have now reverted the vegetation function back to "Set to missing data", if that is more appropriate in my suite (br289, which is the one that contains the added soil moisture/snow amount/soil temperature).

The last restart dump written out is 21 February, so it ran for just over a month and a half. I have had a look at various obvious fields in this restart dump, e.g. surface temperature, and there is nothing obviously wrong.

How can I find out exactly which timestep it failed on? And, if I do need to reduce the timestep, how do I do this? But if I do do this, won't that make it run a lot slower?

Charlie

comment:12 Changed 10 months ago by jeff

Hi Charlie

To find the last timestep look at file work/18830101T0000Z/coupled/pe_output/br289.fort6.pe0000 and search for Atm_Step: Timestep. For the u-br289 run the last timestep was

Atm_Step: Timestep     3976   Model time:   1883-02-26 05:20:00

The run crashed some 5 days after the last restart dump, so any instabilities may not have appeared in the dump file. If you are interested in finding out what fields go unstable you could specify STASH output which just covers timesteps around the failure point or get the model to do restart dumps there.

Looking at the fort6 output file, it shows

====================================================================================
Slow physics source terms from atmos_physics1:
r_u      :         -0.5520726505947214E+01          0.5292094232662707E+01
r_v      :         -0.3949644554213141E+01          0.4602793914356429E+01
r_thetav :         -0.9817391028844222E+01          0.1613968833042159E+02                             NaN          0.1000000000000000E+01
r_m_v    :         -0.6772025096360736E-01          0.6774895802839093E-03                             NaN          0.1000000000000000E+01
r_m_cl   :         -0.7932940502358046E-03          0.7269984446983149E-03                             NaN          0.1000000000000000E+01
r_m_cf   :         -0.4166939959233216E-03          0.4544565515597392E-03                             NaN          0.1000000000000000E+01
r_m_rain :         -0.6497841252941329E-03          0.6091613454145190E-03                             NaN          0.1000000000000000E+01
====================================================================================
********************************************************************************************
Fast physics sources for ENDGame from atmos_physics2:
min                      max                      average (non-bit reproducing  (1=has NaN 0= no NaN)
s_u      :         -0.4490197683719267E+02          0.8333708914961450E+01          0.2535736101247619E-02          0.0000000000000000E+00
s_v      :         -0.1888841086553220E+02          0.1027216816708407E+02          0.6657879124581047E-03          0.0000000000000000E+00
s_w      :          0.0000000000000000E+00          0.0000000000000000E+00          0.0000000000000000E+00          0.0000000000000000E+00
s_thetav :         -0.3629077562007865E+01          0.3842672143021616E+01                             NaN          0.1000000000000000E+01
s_m_v    :         -0.7317328622074307E-02          0.4688009258091733E-02                             NaN          0.1000000000000000E+01
s_m_cl   :         -0.4932251589800525E-03          0.6538282017374688E-03                             NaN          0.1000000000000000E+01
s_m_cf   :         -0.2117070745097046E-03          0.2902829760657792E-03                             NaN          0.1000000000000000E+01
s_m_rain :         -0.4248077065297710E-05          0.4325988428157177E-05                             NaN          0.1000000000000000E+01
********************************************************************************************

I don't know if this means anything to you but it shows which fields have NaNs? in them.

If you reduce the timestep the model will run slower yes, if you half the timestep it will take twice as long. It maybe possible to just reduce the timestep to get over a crash and then put it back to normal, but of course it could crash again or there could be some underlying problem with the model setup. Another method/bodge is to put the last safe dump before the crash through the reconfiguration and restart the model using that.

The timestep is set via variable steps_per_periodim in panel Top Level Model Control->Model Domain and Timestep. In your suite this is set to ${ATMOS_TIMESTEPS_PER_DAY} which is defined in panel Run Initialisation and Cycling, Atmosphere Timesteps per Day = 72.

Jeff.

comment:13 Changed 10 months ago by charlie

Hi Jeff,

Thanks, and sorry for the delay in getting back to you. I am reluctant to reduce the timestep, simply because I don't want to make it any slower. I would much rather focus on WHY the above fields contain NaNs?. It has to be to do with one of my modified ancillaries, either veg, soils, soil moisture etc - even if there is nothing per se wrong with these (because otherwise it would have failed straight away), clearly something I have done is causing an instability after a given amount of time i.e. a scientific, rather than technical, reason.

Interestingly, and I'm not sure if this sheds any light on the problem, when I ran with just my modified veg fraction, veg fraction, soil parameters and soil dust, it blew up within 20 minutes of real-time. However, when I ran with the above, PLUS my modified soil moisture, snow amount and soil temperatures, it ran for 2 hours of real-time before blowing up. So, in this case, adding in MORE things appeared to stabilise it, at least for a bit. Does this shed any light on anything?

Charlie

comment:14 Changed 10 months ago by jeff

Hi Charlie

No it doesn't shed any light on anything for me. If this is a scientific rather than technical problem then I'm unlikely to be of much help.

I would suggest you try and get a dump near the blow up point and see if you can spot any problems. In the Dumping and Meaning panel you can specify irregular dump times, change dump_frequency_units to Timesteps and for dumptimesim enter 720,1440,2160,2880,3600,3975,3976.

Jeff.

comment:15 Changed 10 months ago by charlie

Hi Jeff

Okay, I have now done as you suggested, but my suite failed almost straightaway with lots of different errors, all of which are along the lines of:

?  Error code: 101
?  Error from routine: UM_SHELL
?  Error message: Field - Section:0, Item:4
?        PRELIM:TOTIMP:Error in time period conversion
?  Error from processor: 578
?  Error number: 10

I'm assuming the problem here is not a blowup, but rather something to do with changing to irregular dump times? I think I followed your instructions correctly, i.e. in the relevant panel I changed to "irregular dumping", then changed the units to timesteps, then entered the ones you said. Is there something else I should have done?

Charlie

comment:16 Changed 10 months ago by jeff

Hi Charlie

It looks like you need to turn off all the climate mean diagnostics, I knew they wouldn't work but wasn't sure whether you needed to turn them off.

Before you do this I would take a backup copy of u-br289/app/um/rose-app.conf to make it easy to reinstate the climate mean diagnostics.

A quick way to disable all the climate mean diagnostics is to go to the STASH requests rose edit panel, and click on the use_name column heading, this will sort the diagnostics by use name. Next find the first UPMEAN use name and click on that, then scroll down to the last UPMEAN use name and left click that whilst holding down the shift button. This should highlight all UPMEAN diagnostics. Next right click on the STASH panel and select Ignore these sections, this will take a few minutes. Now you should have all UPMEAN diagnostics deselected. Save and try the run again.

Jeff.

comment:17 Changed 10 months ago by charlie

Okay, many thanks, I have now done that. I'll let you know what I find if it blows up at the same timestep.

Also, just so I know, am I right in thinking that the timestep is 20 minutes, and therefore timestep 720 = 10 days, 1440 = 20 days, and so on?

Charlie

comment:18 Changed 10 months ago by jeff

Yes that's right.

Jeff.

comment:19 Changed 10 months ago by charlie

Hi Jeff,

Right, I don't understand what's going on here, as it has now blown up yet again but this time giving me what I call the "halo error" i.e.

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 15
?  Error from routine: LOCATE_HDPS
?  Error message: North/South halos too small for advection.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 655
?  Error number: 135
????????????????????????????????????????????????????????????????????????????????

This time, it only ran for 20 minutes (failing at timestep 1035 according to the fort6 file), so nowhere near the 2 hours I was getting yesterday. It has written out the first restart dump at 10 days, but never got as far as the 2nd. I don't understand this - the suite is still br289, and absolutely nothing has changed since I last ran it other than changing the start dumps to irregular and turning off the climate meaning diagnostics, as you instructed. Other than that, it is identical. Although I have modified my ancillaries slightly, today in fact, the new versions are in an entirely separate directory (to avoid confusion) and this suite is still pointing to the original ones i.e. the same as the last few days.

Would it be completely impractical for it to write out every single timestep, that way no matter where it fails I always have the final and penultimate dumps? If so, how would I do this? Is the only way to literally write 1000 odd numbers (or actually more than that, given that it failed at 1035) in that same panel? There must be an easier way?

Charlie

comment:20 Changed 10 months ago by jeff

Hi Charlie

Not sure why it crashed at a different time, the error message may not be too important, when the model data has NaNs? it will crash somewhere. There must have been some difference between the setups, but if you didn't save the old pe_output files I can't really check the difference. I guess turning off climate meaning may change the model evolution.

Each output dump is 2 Gigabytes so 1000 of them is 2 Terabytes which is quite a lot of data. Why don't you rerun and add 1034,1035 output times? If you select Regular frequency dumps with possible meaning sequence you can select output every timestep if you wanted to do that.

Jeff.

comment:21 Changed 9 months ago by charlie

Hi,

Sorry for the delay. Just as a quick update - I have now tried numerous tests, incrementally changing each ancil, and have now narrowed the problem down to the vegetation fraction. Everything else runs, and is stable. Given that it fails roughly 1.5 months into the run, rather than at the very first timestep, this is clearly now a scientific error, rather than technical. In other words, something I have done to this field is causing it to become unstable after a certain amount of time.

I will therefore close this ticket, as I need to track down the scientific problem. Many thanks for your help.

Charlie

comment:22 Changed 9 months ago by charlie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.