Opened 4 months ago

Closed 6 weeks ago

#2582 closed help (fixed)

Same error as before but never resolved: North/South halos too small for advection

Reported by: charlie Owned by: um_support
Priority: normal Component: UM Model
Keywords: Cc:
Platform: NEXCS UM Version: 10.7

Description

Hi,

Sorry to raise yet another ticket, but one my suites has failed and has given me an error that I saw before, but we never resolved. You can see the original error in ticket #2491, as well as further details. Essentially, this time, my suite (u-ay314) is a modern suite (so no Eocene changes at all) and I have started it from the beginning, with all the aerosols being 12 monthly climatologies. So my concern in ticket #2491 comment 2 can't be the problem, because I started it from the beginning and it has again failed after approximately 20 years giving me the same error, below:

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 15
?  Error from routine: LOCATE_HDPS
?  Error message: North/South halos too small for advection.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 520
?  Error number: 19
????????????????????????????????????????????????????????????????????????????????

Please can you advise?

Thanks,

Charlie

Change History (11)

comment:1 Changed 4 months ago by charlie

Hi,

Further to this, several of my other suites (started at roughly the same time as the one above) have now fallen over at exactly the same point and are giving me exactly the same error as above. Even though they are all slightly different, something therefore must be common to all of them. The only reason I submitted the ticket above a few days ago is because this suite reached the failure point earlier than the others, presumably because of differences in queueing.

To clarify - last week, I set off 4 suites, all of which are very similar but slightly different. They are all set to run for 50 years, using 3 year cycling, and they all start at the same point (September 1988). They have all failed halfway through the 2012 cycle (roughly 18 hours in). They are as follows:

1) u-ba408: Eocene land sea mask (and related ancillaries), Eocene aerosols modified by me, Eocene veg (modified by me), Eocene SST (modified by me), elevated CO2 (3x preindustrial)
2) u-ba436: identical to above, but with elevated CO2 (6x preindustrial)
3) u-ba437: identical to above, but with elevated CO2 (12x preindustrial)
4) u-ay314: Everything modern (i.e. no ancillaries modified) - mentioned above

All of the above use climatology versions of all ancillaries - at least all the ones I have been able to track down. I thought at first that perhaps there was an ancillary common to all of these (because after all the first 3 were copied originally from the 4th, and then modified), which ended in 2012 and was therefore causing the same error in each. But if there is, I don't know where it is. Besides, if the problem was that one of the ancillaries were simply running out of time (e.g. ended in 2012), it would give me a different error, wouldn't it? I have had that problem before, and the error is fairly obvious. It's not this error, as above.

I don't see how the error can be coming from any of the Eocene modifications common to the first 3 suites (either my modifications or those of other people), because if that was the case why is the last suite (which is modern i.e. which has not been modified at all and which uses all standard input) also failing at exactly the same point and with exactly the same error?

Please can you advise? To reiterate, none of my suites (even my practice ones several weeks ago) have run beyond this stage, so there must be something happening after ~14 years in all of them.

Please help!

Many thanks

Charlie

comment:2 Changed 4 months ago by charlie

Hi again,

If it helps at all for those of you in Reading, I’ll be on campus most of tomorrow so could easily come to chat. I have meetings 10-12 but will then be around and available until 3pm.

Charlie

comment:3 Changed 4 months ago by grenville

Charlie

The halo error is not very helpful, but the fact that 4 suites fail at 2013-01-01 points to configuration problem. The reason is that you have l_clmchfcg set to .true. - which means the model is using time-varying greenhouse gases (which are only defined until 2013).

It looks like you need to set l_clmchfcg to .false. so that it uses your fixed values for c02_mmr, n2o_mmr…..

Grenville

comment:4 Changed 4 months ago by charlie

Thanks very much Grenville, should I just search for that switch within Rose? Unfortunately, and very stupidly, I have left my Monsoon fob at home and am working on campus all day today, so can't do this until this evening.

In the meantime, however: is there any way of finding out whether there are any other files that also end in 2013, that I am also unaware of it (just like the varying greenhouse gases)? So that, if I turn that switch off and resubmit, it doesn't just fail again after 14 years due to a different varying file?

Also, once I have switched this switch off, do I need to run from the beginning or is this something that can be restarted from where it failed?

Charlie

comment:5 Changed 4 months ago by charlie

Hi Grenville,

Just to say that I think have answered my last question, above - obviously I will need to restart from the beginning, because otherwise it would have used varying greenhouse gases for the first 14 years whereas I wanted to use my fixed (and elevated) greenhouse gases from the beginning. So I need to restart afresh.

But please can you advise on my other question, i.e. is there anyway I can find out whether this problem exists in other files i.e. to any other files and in 2013?

Thanks,

Charlie

comment:6 Changed 4 months ago by grenville

Charlie

Data for c02_mmr, n2o_mmr… is not in external files - it's all in app/um/rose-app.conf.

Short of checking by hand the suite configuration, there is no way to know that it has data appropriate for the length of run desired.

It'd be a good project for a student to implement such checks.

Grenville

comment:7 Changed 4 months ago by charlie

Okay, I wish I had a student to do that!

However, how did you know that this ended in 2013? I have just searched for co2_mmr and the others within this file, but the only place it appears is where the amount is listed. Where does it say that it ends in 2013? I suppose I was rather hoping that it might link to a file, which would have 2013 in its title, and I could then grep for other similar files?

Charlie

comment:8 Changed 3 months ago by charlie

Hi again,

Further to this, I restarted (from the beginning) my 4 suites yesterday (having changed the above switch to false, as suggested). I submitted them at all roughly the same time, but one of them (u-ba436) has failed right at the beginning of the 2nd cycle. The others are still going fine. I have checked the error logs, but there doesn't appear to be any obvious error. What's gone wrong this time?!

Charlie

comment:9 Changed 3 months ago by grenville

Charlie

The error says:

apsched: claim exceeds reservation's node-count
[FAIL] um-atmos # return-code=1
2018-08-29T03:42:12Z CRITICAL - failed/EXIT

which seems to be spurious given it'd been running OK — pl try restarting the suite.

Grenville

comment:10 Changed 3 months ago by charlie

Okay, I have restarted the suite and it appears to be running. What might have caused this failure?

comment:11 Changed 6 weeks ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.