#2558 closed help (fixed)

Vegetation fraction ancillary file problem revisited

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7

Description

Hi,

Right then, we have now finally come back to the error that I originally asked about in ticket #2495 (specifically comment 6). In the previous ticket, I mistakenly thought this error was due to the aerosol emissions files being changed but, as Grenville rightly pointed out, I was changing lots of things all at once - specifically all the aerosol emissions files AND the vegetation fraction ancillary.

I have now resolved the aerosol emissions problem (thanks to Luke and ticket #2546), so have returned to the vegetation fraction problem.

My suite is now u-az608, and it runs absolutely fine (I have just finished a test run of 3 years, see ticket #2551). When I then change the vegetation fraction over, however, I again get the error below:

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 1
?  Error from routine: EG_BICGSTAB
?  Error message: Convergence failure in BiCGstab, omg is NaN
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 542
?  Error number: 18
????????????????????????????????????????????????????????????????????????????????

Following your previous instructions (in ticket #2495 comment 9) I have already turned PRINT_STATUS to "Extra diagnostics", which is the only option.

As I said above, this time round I have ONLY changed the vegetation fraction ancillary file, so something here must be causing this error. All I have done to the file is replaced one of the tiles (typically tile 9, ice) with zeros, and then increased one of the other tiles by the same amount at each location so that each grid point still equals 1. I'm assuming the fact that the model is getting past the recon stage means that it is not a schoolboy error e.g. being upside down or having the wrong attributes. So what is going wrong?

Many thanks,

Charlie

Change History (17)

comment:1 Changed 11 months ago by charlie

Hi again,

Sorry to bother you about this, but has anybody had a chance to look at the above question? Does anybody have any idea what might be causing this problem?

Thanks,

Charlie

comment:2 Changed 10 months ago by ros

Hi Charlie,

Sorry for the delay, we're in holiday season with several people away. I will see if anyone can help you will this in the meantime, but there will likely be a delay.

If you look in the job.out file you will see that 2 NaNs have been generated by the physics scheme in s_u and s_v.

Regards,
Ros.

comment:3 Changed 10 months ago by simon

Hi,

I'm currently unable to log onto NEXCS to look at the ancil, but a couple of things immediately come to mind. One is that the grid point sums are not exactly 1.0, they're worth checking again, and the other is that somehow the mask has been altered in the new ancil so that missing data is being misinterpreted as real data in the model.

Simon.

comment:4 Changed 10 months ago by charlie

Hi Simon,

Many thanks. As to your the comments: firstly, I have already doublechecked that each grid point adds up to 1 - and indeed it does. This was a problem before, but I realised the error and resolved it (at least I think I did). Secondly, I haven't altered the mask.

When you are able to get onto NEXCS, would you be able to take a look at both of my files? They are at /home/d05/cwilliams/ga71/ancils/vegfrac where qrparm.veg.frac is the one one I'm trying to run with (which doesn't work) and qrparm.veg.frac_orig is the original version (which does work, but is wrong).

The first weird thing you will notice about the original is that the latitudes are upside-down - yet somehow it still works. If I try to run with my new file, but with the latitudes also upside-down (i.e. the same as the original), in fails at the recon stage telling me my latitudes are upside-down - as I would expect!

If I try to run with my new file but with the latitudes the right way up, it gets past the recon stage but then gives me the error above. To respond to Ros' remark (comment 3), I don't particularly understand the answer - what is the s_u and s_v of the physics scheme and why should this be playing up with a different vegetation fraction ancillary file?

The ONLY difference between my version and the original is that the original contains values in tile 9 (ice). It shouldn't - everything should be zero here, apart from the missing values which are currently -9999.0 (just as they are in the original). So I have created a new file, with all the ice values decreased to 0, and all the corresponding bare soil values increased by the same amount. As I said above, I have checked that the grid points still balance, and they do. Everything else should be absolutely identical between my file and the original, including the mask, attributes, metadata, etc.

Charlie

comment:5 Changed 10 months ago by simon

Hi,

I've had a look at the ancil. One thing that's immediately obvious is that the missing data indicator in the field doesn't match the header. In the first instance, change all of the -9999 data points to -32768*32768 then look at it with xconv to check if these are now interpreted as missing data. If it still fails after that, and I'm guessing your configuring the ancil rather than updating it, is to look at the field in the processed start dump (ie the output from the recon) to see if it looks as expected.

comment:6 Changed 10 months ago by charlie

Is this mismatch also in the original file? If so, how come the original file works and my new one doesn't?

comment:7 Changed 10 months ago by simon

Yes, but did it work _correctly_? Eyeballing both ancils with xconv reveals some worrying banding (mainly hidden as xconv is treating -9999 as data) viewing the data reveals these bands, for instance in level 2 there is a row of 0.15323 at latitude 64.375 in the original file. Replace all values of -9999 as previously described and inspect.

Also the inverted data is due to the ancil being created as a 4.5 ancil. In these the data is stored north to south, rather than south to north. Always ensure that you create your ancils with UM version numbers greater (or equal to) 8.

comment:8 Changed 10 months ago by charlie

Hi Simon,

Right, I think we are getting to the bottom of this - in fact you have highlighted a problem with that I was always worried about, but was told didn't matter!

In short, I didn't create the original file. Months ago, I did, however, inspect the file myself, and also noticed the very weird banding. I thought at the time that this can't be right. However, I spoke to the person who created the file, and he assured me it didn't matter. He told me, as you said, to look at the field in the output start dump (in the recon output) to check, and it does indeed look sensible. The advice I was given was that if this looked sensible, which it did, then the model was internally doing something to the ancillary and so the banding in the ancillary doesn't matter. Was this wrong?

Likewise, with the inverted data: yes, I know full well about this issue, and have fallen foul of choosing the wrong version myself in the past so am now very careful. Whenever I create my new ancillaries, I am certain to choose version 10.7. However, if the original file was made with version 4.5 and therefore has upsidedown latitudes, why does it work? To answer your question, yes it does work correctly - I did a practice run with this for 20 years, and everything in the output looks sensible. In fact, the only thing in the output which looks wrong was surface air temperature, which was too low in exactly the same locations as where the ice is in the ancillary. This is entirely sensible and what I would expect. So now I need to remove this ice.

Charlie

comment:9 Changed 10 months ago by simon

Right, I think I may know what's going on. The original ancil might be designed to be masked via the reconfiguration, with all of its own dodgy missing data and banding hidden once masked when configured into the dump. However, if you look at the start dump generated using your ancil, there is a large block of bad missing data at the S pole (with evidence of banding). This is being interpreted as data and causing the model to fail. I'm guessing this should at the N pole, where it would be masked during the reconfiguration.

Does this block of bas missing data appear in the model that worked? If not, try inverting the data in the new ancil. Then configure and look at the resultant field in the start dump, then compare it with the same field in the start dump of the version that worked.

comment:10 Changed 10 months ago by charlie

Hi Simon. Sorry, where are you looking to see the large block of bad missing data?

comment:11 Changed 10 months ago by simon

az608a.da19880901_00

comment:12 Changed 10 months ago by charlie

In ~/cylc-run/u-az608/share/data/History_Data?

comment:13 Changed 10 months ago by charlie

Okay, further to this, I have now checked both restart dumps i.e. the new one and the original. As I said, the original looks fine. The new one, though, contains the same -9999.0 which is making viewing difficult in xconv. But, as you say, there is lots of missing data and banding.

When you say to invert the data in my new version, do you mean just the data or do you mean the latitudes as well? If I invert the latitudes, so that they are upside down relative to all other ancillary files (but the same as the original), it fails at the recon stage telling telling me my the are upside down.

comment:14 Changed 10 months ago by simon

Hi,

Keep the latitudes, invert the data. What I suspect what happened is that data in the original is inverted wrt to the latitudes, for a 4.5 ancil file. That is in a correct 4.5 ancil file the data should go N-S, but in this file it goes S-N. When this is read by the recon, it is processed assuming the S-N order and inserted correctly into the dump. The recon will ignore the ancil version number, and only does a cursory check on the header for the latitudes. It's a case of two negatives making a positive. When you create a non-4.5 file based on this, you need to flip the data again to get into the correct order.

Hope that made some sort of sense. Anyway, try inverting the data, but keeping the header the same.

comment:15 Changed 10 months ago by charlie

Hi Simon,

Okay, I did what you said and inverted just the data and… it's running. Or at least it has been running for almost 2 hours so far, which is far more than it ever has before with my new vegetation ancillary. As soon as it finishes its first cycle, which will probably be later on this evening, I will take a look at the restart dump and see if the vegetation fraction looks sensible.

Many thanks,

Charlie

comment:16 Changed 10 months ago by charlie

Hi Simon,

Success! My suite ran for 3 years overnight, and successfully produced (and archived) its output. I have checked several restart dumps (in both ~cylc-run on Monsoon and in the actual archived data (on JASMIN)), and the vegetation fraction looks exactly right. It also looks exactly the same as my original run using Will's original ancillary, with the only difference between the original run and my new being a lack of ice in the restart dump - which is exactly right and as it should be.

Very many thanks for sorting this out,

Charlie

comment:17 Changed 10 months ago by simon

  • Resolution set to fixed
  • Status changed from new to closed

Excellent. Things weren't helped by the somewhat dodgy original ancil, but we got there eventually.

I'll close the ticket.

Simon

Note: See TracTickets for help on using tickets.