Opened 4 weeks ago

Last modified 38 hours ago

#3083 new help

Reconfiguration taking longer than usual

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7

Description

Hi,

Sorry to bother you, but one of my suites (bk944) is taking a lot longer to run recon app than it used to i.e. when I ran the suite before, this app took a few minutes and no more, whereas now it is taking several hours and, indeed, is failing because it runs out of wall clock time.

The only difference between doing it before, and now, is that I am restarting from a previous restart dump (both atmosphere and ocean), whereas before I was starting from a previous atmosphere restart dump, but a resting ocean (i.e. climatology). Would the inclusion of the ocean restart dump make such a difference to the reconfiguration?

If so, please can you advise on how I can change the wall clock time in my suite, to give it longer (and if so, how much longer)?

Many thanks,

Charlie

Change History (24)

comment:1 Changed 4 weeks ago by dcase

Charlie,

I'm afraid that I don't know how long this will take, but I think that you can change the [[RCF_RESOURCE]] [[[job]]] execution time limit in site/meto_cray.rc

There are other options here too, but bumping up the time limit (to several hours??) would be a sensible first start.

comment:2 Changed 4 weeks ago by grenville

Charlie

I see the same behaviour - the reconfiguration appears to have created a start file (in a few mins as usual). It appears that that the suite isn't being told to stop. I tried running the model with the file created - it errored on timestep 1 with A total of 3024 points had negative mass in set_thermodynamic., but again, the suite did not stop.

Can you rerun with the earlier configuration?

Grenville

comment:3 Changed 4 weeks ago by charlie

Hi,

I have now tried increasing the time limit as you suggest, to 2 hours and then 4 hours, but both times it ran out of time. I am now trying 8 hours, but is it really meant to take this long, just for the reconfiguration?! As I said, it has never taken anywhere near this long before (usually a matter of minutes), and the only difference is it is reconfiguring an ocean restart dump as well as the atmosphere dump, and 4-5 modified ancillaries. But even so, it shouldn't be this slow, should it?

Charlie

comment:4 Changed 4 weeks ago by grenville

Charlie

Reconfiguration is an atmosphere operation - it does nothing with ocean files

comment:5 Changed 4 weeks ago by charlie

Sorry, I think our messages crossed.

When you say "earlier configuration" do you mean to try running like did before i.e. restarting from the atmosphere dump, but not the ocean dump (i.e. starting from ocean climatology)?

comment:6 Changed 4 weeks ago by charlie

Further to this, I have now just resubmitted the suite, but with all of the original ancillaries and restart dumps. So if this works, there must be a problem with One of my changes. If not, however, there is something else going on, because currently essentially it is now a PI.

comment:7 Changed 3 weeks ago by charlie

Hi Grenville,

Sorry for delay, I have now gone through each of my modified ancillaries, swapping them in and out with the PI versions, to find out which one is causing the problem. For your information, the 6 ancillaries I have modified are: orography, vegetation fraction, vegetation function, soil dust, soil parameters, and ozone.

I have now ascertained, by starting with a complete PI run and then slowly swapping each of the above in one by one, that the problem is with the soil parameters. In other words, if I run with my versions of the other 5, and the PI soil parameters, it works fine and gets passed the reconfiguration stage. If I use my soil parameters, the reconfiguration appears to get stuck again.

This is not a particularly new problem, and is one I have experienced before - last time it was a problem between the soil parameters and the vegetation fraction, e.g. I was incorrectly putting soil parameters were there was no vegetation, or vice versa. I will look into this myself now.

What IS different, however, is the timing of the error. Last time I had this problem, and indeed any similar problem with any of my ancillaries, the error always occurs at the first timestep of the coupled stage, and is always something like "Convergence failure in BiCGstab, omg is NaN", which usually implies either a problem with the mask or the above vegetation versus soil issue. I have never had this problem during the reconfiguration stage before, and it has never just failed to stop. Can you advise what's going on here? The error you mention above, "A total of x points had negative mass in set_thermodynamic", I HAVE seen before but only ever several years into a run, implying a dynamical blowup - I have never seen it right at the first timestep.

Can you advise further?

Many thanks,

Charlie

comment:8 Changed 2 weeks ago by charlie

Hi again,

Further to this, I have now tried correcting the soils parameters/dust to avoid the above problem i.e. I have rebuilt these 2 files, so that they are correct (i.e. wherever there is ice in the veg, soil parameters/dust = 0 and wherever there is non-ice in the veg, soil parameters/dust >0). I have made sure that all the fields that need to sum to 1 indeed do (i.e. vegetation fraction and dust (clay + silt + sand)), and have checked the LSM. Everything appears to be fine. However, yet again, when I try to run with my new files, recon just keeps running and running, producing a start file but not stopping.

Can you possibly advise further, as I am running out of ideas?

Many thanks,

Charlie

comment:9 Changed 12 days ago by grenville

Charlie
You ancil file has bad data - um-pumf says for VOL SMC AT WILTING AFTER TIMESTEP (for example, but other fields exhibit the same)

FIELD NO 1: 0.938 90.938 180.938 270.938 359.063


1: -89.375: 0.0000 0.0000 0.0000 0.0000 0.0000
29: -54.375: NaN NaN NaN NaN NaN
57: -19.375: NaN NaN NaN NaN NaN
85: 15.625: 0.83831E-01 NaN NaN 0.27231 0.78687E-01
113: 50.625: 0.20164 0.15853 NaN 0.59201E-01 0.20122
144: 89.375: NaN NaN NaN NaN NaN

I think the NaNs? should be missing data indicators.

Grenville

comment:10 Changed 11 days ago by charlie

Hi Grenville,

Right, okay, I understand. So I have now changed all of my 4 files (vegetation fraction, vegetation function, soil parameters, soil dust) so that instead of having NaN over ocean, they instead have a value of 2.0000e+20. I have checked the preindustrial versions of these, and this is the value it uses.

I have just tried running again with these, and exactly the same problem occurs - reconfiguration just doesn't stop.

I noticed that with one of these files (soil parameters, at /home/d05/cwilliams/pliocene/gc31/ancils/soil/parameters on NEXCS) my values above have been changed in the process of converting to UM format (using xancil), so that they are now -1.0737e+09. Might this be the problem, i.e. it is still expecting 2.0000e+20? If so, how do I stop xancil from converting my 2.0000e+20? I have tried the various options, including telling it to use "no mask", but it still gives me these values.

Charlie

comment:11 Changed 11 days ago by grenville

Charlie

Which files allow the reconfiguration to run OK?

Grenville

comment:12 Changed 11 days ago by charlie

Hi Grenville,

The reconfiguration only works if I use the preindustrial versions of vegetation and soils, i.e. the following:

/projects/um1/ancil/atmos/n96e/orca1/vegetation/func_type_modis/v4/qrparm.veg.func
/projects/um1/ancil/atmos/n96e/orca1/vegetation/fractions_igbp/v4/qrparm.veg.frac
/projects/um1/ancil/atmos/n96e/orca1/soil_parameters/hwsd_vg/v4/qrparm.soil
/projects/um1/ancil/atmos/n96e/orca1/soil_dust/hwsd/v4/qrparm.soil.dust

If I use these, alongside my modified topography and ozone, the model does at least run.

Charlie

comment:13 Changed 11 days ago by grenville

You'd mentioned before that only your soil parameters caused the problem - is that not the case?

comment:14 Changed 11 days ago by charlie

Yes, sorry, that's right. If I use my versions of the veg, but the PI versions of the soils, the reconfiguration works. It still fails at the coupled stage, but I know why that is (it's because the PI soils have zeros over Antarctica to correspond to ice regions, but my versions of the veg have some veg over Antarctica).

However, if I use my versions of the veg and my versions of the soil, it doesn't even get this far and fails at the reconfiguration stage.

Charlie

comment:15 Changed 10 days ago by grenville

Charlie

please point me to the xancil saved job which was used to create /home/d05/cwilliams/pliocene/gc31/ancils/soil/parameters/qrparm.soil
do you have other soil parameter files that you have created and successfully run with - if so, please point to them?

Grenville

comment:16 Changed 10 days ago by charlie

Hi Grenville,

Just to say that I haven't forgotten about this, but due to the maintenance going on right now I can't give you this information. I don't currently have an xancil saved job for the soil parameters, but I can create one as soon as the system comes back.

I do have another soil parameter file, which I created for the Eocene, which I think works, so I can point you to that as well, as soon as returns.

In the meantime, if it helps, the preindustrial versions are at:

/projects/um1/ancil/atmos/n96e/orca1/soil_dust/hwsd/v4
/projects/um1/ancil/atmos/n96e/orca1/soil_parameters/hwsd_vg/v4

Many thanks, more later (or tomorrow) as soon as the system returns.

Charlie

comment:17 Changed 10 days ago by grenville

Thanks Charlie - this is quite a puzzle!

comment:18 Changed 9 days ago by charlie

Hi Grenville,

Now that the system has returned, I can follow this up. To answer your first question, the job file to make the soil parameters is at /home/d05/cwilliams/pliocene/gc31/ancils/soil/parameters/job_soil_p.job and the one to make the soil dust fields is at /home/d05/cwilliams/pliocene/gc31/ancils/soil/dust/job.soil_d.job

Both of these are using the "generalised ancillary" option, because this is what I was told to do ages ago by Jeff, in order to make these files. I have just tried to make the parameters file again, and yet again although I am selecting "no mask" in xancil, it is still applying a mask and is still changing my values of 2.0000e+20 to. This does not seem to happen with the soil dust file, which retains its 2.0000e+20 after being made.

To answer your 2nd question, the soil parameters and soil dust fields that I used previously, for my Eocene suite (which does work, albeit unstable) are at:

/home/d05/cwilliams/gc31/final_ancils/soil_dust/hwsd
/home/d05/cwilliams/gc31/final_ancils/soil_parameters/hwsd_vg

Neither of these present any problems in either the reconfiguration or the coupled stage i.e. the model runs with these.

I have just compared the parameters file from these, with my new one, and they are identical in terms of the number of fields, field names, etc. The only difference, other than the mask, is that the Eocene version is a global mean everywhere, whereas the Pliocene version is not. What's really weird is that the Eocene version (which, as I said, DOES run) actually contains NaN instead of 2.0000e+20, which is what we thought the problem with the Pliocene version was in the first place!

Charlie

comment:19 Changed 5 days ago by jeff

Hi Charlie

It looks like this problem is caused by a bug in the reconfiguration, see https://code.metoffice.gov.uk/trac/um/ticket/3533 for further details.

I've backported the fix to vn10.7, include this branch in your job

branches/dev/jecole/um/vn10.7_uniform_smc_stress_in_recon

Hopefully this will fix the problem.

Jeff.

comment:20 Changed 3 days ago by charlie

Hi Jeff,

Sorry for the delay.

I have now tried running with this branch, which I inserted using the GUI (at fcm_make_um > env > Sources > um_sources), but it failed at the the fcm_make_um stage giving me following error:

 33) GC3-PrgEnv/2.0/24708
[FAIL] file:///home/d04/fcm/srv/svn/um.xm/main/branches/dev/jecole/um/vn10.7_uniform_smc_stress_in_recon: not found
[FAIL] svn: warning: W170000: URL 'file:///home/d04/fcm/srv/svn/um.xm/main/branches/dev/jecole/um/vn10.7_uniform_smc_stress_in_recon' non-existent in revision 78959
[FAIL] 
[FAIL] svn: E200009: Could not display info for all targets because some targets don't exist

[FAIL] fcm make -f /working/d05/cwilliams/cylc-run/u-bk944/work/18500101T0000Z/fcm_make_um/fcm-make.cfg -C /var/spool/jtmp/7953507.xcs00.5QZI4y/fcm_make_um.18500101T0000Z.u-bk944IAYobi -j 6 --archive # return-code=2

Have I inserted this branch in the wrong place?

Charlie

comment:21 Changed 3 days ago by jeff

Sorry Charlie the branch name should be

branches/dev/jecole/vn10.7_uniform_smc_stress_in_recon

You are putting in the right place.

Jeff.

comment:22 Changed 3 days ago by charlie

I have just tried that new pathname, and same error:

[FAIL] file:///home/d04/fcm/srv/svn/um.xm/main/branches/dev/jecole/vn10.7_uniform_smc_stress_in_recon: not found
[FAIL] svn: warning: W170000: URL 'file:///home/d04/fcm/srv/svn/um.xm/main/branches/dev/jecole/vn10.7_uniform_smc_stress_in_recon' non-existent in revision 78962
[FAIL] 
[FAIL] svn: E200009: Could not display info for all targets because some targets don't exist

[FAIL] fcm make -f /working/d05/cwilliams/cylc-run/u-bk944/work/18500101T0000Z/fcm_make_um/fcm-make.cfg -C /var/spool/jtmp/7956140.xcs00.0crHBm/fcm_make_um.18500101T0000Z.u-bk944s_WS_9 -j 6 --archive # return-code=2
2019-12-12T17:15:18Z CRITICAL - failed/EXIT

comment:23 Changed 3 days ago by jeff

Sorry again, third time lucky, the branch name should be

branches/dev/jeffcole/vn10.7_uniform_smc_stress_in_recon

Jeff.

comment:24 Changed 38 hours ago by charlie

Hi Jeff,

Right then, that seems to have worked, and the reconfiguration went through fine. Even more to my surprise (and I genuinely am surprised), the coupled stage has now been going for over 20 minutes, which therefore implies that there was nothing wrong with my ancillary files in the first place. Whenever there has been a problem here, it always fails at the first time step i.e. 6 minutes in.

So…

The only thing I don't understand, at least not entirely, is what the problem was. I read through the ticket you pointed me to, but didn't quite understand the problem or what you have done to resolve it. Why was it a problem when I use my newly created ancillaries, when it didn't happen with the standard PI versions? Please can somebody explain in more detail, so I know exactly what's going on here and now it was resolved?

Many thanks,

Charlie

Note: See TracTickets for help on using tickets.