Opened 12 months ago

Closed 9 months ago

#3083 closed help (answered)

Reconfiguration taking longer than usual

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7



Sorry to bother you, but one of my suites (bk944) is taking a lot longer to run recon app than it used to i.e. when I ran the suite before, this app took a few minutes and no more, whereas now it is taking several hours and, indeed, is failing because it runs out of wall clock time.

The only difference between doing it before, and now, is that I am restarting from a previous restart dump (both atmosphere and ocean), whereas before I was starting from a previous atmosphere restart dump, but a resting ocean (i.e. climatology). Would the inclusion of the ocean restart dump make such a difference to the reconfiguration?

If so, please can you advise on how I can change the wall clock time in my suite, to give it longer (and if so, how much longer)?

Many thanks,


Change History (26)

comment:1 Changed 12 months ago by dcase


I'm afraid that I don't know how long this will take, but I think that you can change the [[RCF_RESOURCE]] [[[job]]] execution time limit in site/meto_cray.rc

There are other options here too, but bumping up the time limit (to several hours??) would be a sensible first start.

comment:2 Changed 12 months ago by grenville


I see the same behaviour - the reconfiguration appears to have created a start file (in a few mins as usual). It appears that that the suite isn't being told to stop. I tried running the model with the file created - it errored on timestep 1 with A total of 3024 points had negative mass in set_thermodynamic., but again, the suite did not stop.

Can you rerun with the earlier configuration?


comment:3 Changed 12 months ago by charlie


I have now tried increasing the time limit as you suggest, to 2 hours and then 4 hours, but both times it ran out of time. I am now trying 8 hours, but is it really meant to take this long, just for the reconfiguration?! As I said, it has never taken anywhere near this long before (usually a matter of minutes), and the only difference is it is reconfiguring an ocean restart dump as well as the atmosphere dump, and 4-5 modified ancillaries. But even so, it shouldn't be this slow, should it?


comment:4 Changed 12 months ago by grenville


Reconfiguration is an atmosphere operation - it does nothing with ocean files

comment:5 Changed 12 months ago by charlie

Sorry, I think our messages crossed.

When you say "earlier configuration" do you mean to try running like did before i.e. restarting from the atmosphere dump, but not the ocean dump (i.e. starting from ocean climatology)?

comment:6 Changed 12 months ago by charlie

Further to this, I have now just resubmitted the suite, but with all of the original ancillaries and restart dumps. So if this works, there must be a problem with One of my changes. If not, however, there is something else going on, because currently essentially it is now a PI.

comment:7 Changed 12 months ago by charlie

Hi Grenville,

Sorry for delay, I have now gone through each of my modified ancillaries, swapping them in and out with the PI versions, to find out which one is causing the problem. For your information, the 6 ancillaries I have modified are: orography, vegetation fraction, vegetation function, soil dust, soil parameters, and ozone.

I have now ascertained, by starting with a complete PI run and then slowly swapping each of the above in one by one, that the problem is with the soil parameters. In other words, if I run with my versions of the other 5, and the PI soil parameters, it works fine and gets passed the reconfiguration stage. If I use my soil parameters, the reconfiguration appears to get stuck again.

This is not a particularly new problem, and is one I have experienced before - last time it was a problem between the soil parameters and the vegetation fraction, e.g. I was incorrectly putting soil parameters were there was no vegetation, or vice versa. I will look into this myself now.

What IS different, however, is the timing of the error. Last time I had this problem, and indeed any similar problem with any of my ancillaries, the error always occurs at the first timestep of the coupled stage, and is always something like "Convergence failure in BiCGstab, omg is NaN", which usually implies either a problem with the mask or the above vegetation versus soil issue. I have never had this problem during the reconfiguration stage before, and it has never just failed to stop. Can you advise what's going on here? The error you mention above, "A total of x points had negative mass in set_thermodynamic", I HAVE seen before but only ever several years into a run, implying a dynamical blowup - I have never seen it right at the first timestep.

Can you advise further?

Many thanks,


comment:8 Changed 11 months ago by charlie

Hi again,

Further to this, I have now tried correcting the soils parameters/dust to avoid the above problem i.e. I have rebuilt these 2 files, so that they are correct (i.e. wherever there is ice in the veg, soil parameters/dust = 0 and wherever there is non-ice in the veg, soil parameters/dust >0). I have made sure that all the fields that need to sum to 1 indeed do (i.e. vegetation fraction and dust (clay + silt + sand)), and have checked the LSM. Everything appears to be fine. However, yet again, when I try to run with my new files, recon just keeps running and running, producing a start file but not stopping.

Can you possibly advise further, as I am running out of ideas?

Many thanks,


comment:9 Changed 11 months ago by grenville

You ancil file has bad data - um-pumf says for VOL SMC AT WILTING AFTER TIMESTEP (for example, but other fields exhibit the same)

FIELD NO 1: 0.938 90.938 180.938 270.938 359.063

1: -89.375: 0.0000 0.0000 0.0000 0.0000 0.0000
29: -54.375: NaN NaN NaN NaN NaN
57: -19.375: NaN NaN NaN NaN NaN
85: 15.625: 0.83831E-01 NaN NaN 0.27231 0.78687E-01
113: 50.625: 0.20164 0.15853 NaN 0.59201E-01 0.20122
144: 89.375: NaN NaN NaN NaN NaN

I think the NaNs? should be missing data indicators.


comment:10 Changed 11 months ago by charlie

Hi Grenville,

Right, okay, I understand. So I have now changed all of my 4 files (vegetation fraction, vegetation function, soil parameters, soil dust) so that instead of having NaN over ocean, they instead have a value of 2.0000e+20. I have checked the preindustrial versions of these, and this is the value it uses.

I have just tried running again with these, and exactly the same problem occurs - reconfiguration just doesn't stop.

I noticed that with one of these files (soil parameters, at /home/d05/cwilliams/pliocene/gc31/ancils/soil/parameters on NEXCS) my values above have been changed in the process of converting to UM format (using xancil), so that they are now -1.0737e+09. Might this be the problem, i.e. it is still expecting 2.0000e+20? If so, how do I stop xancil from converting my 2.0000e+20? I have tried the various options, including telling it to use "no mask", but it still gives me these values.


comment:11 Changed 11 months ago by grenville


Which files allow the reconfiguration to run OK?


comment:12 Changed 11 months ago by charlie

Hi Grenville,

The reconfiguration only works if I use the preindustrial versions of vegetation and soils, i.e. the following:


If I use these, alongside my modified topography and ozone, the model does at least run.


comment:13 Changed 11 months ago by grenville

You'd mentioned before that only your soil parameters caused the problem - is that not the case?

comment:14 Changed 11 months ago by charlie

Yes, sorry, that's right. If I use my versions of the veg, but the PI versions of the soils, the reconfiguration works. It still fails at the coupled stage, but I know why that is (it's because the PI soils have zeros over Antarctica to correspond to ice regions, but my versions of the veg have some veg over Antarctica).

However, if I use my versions of the veg and my versions of the soil, it doesn't even get this far and fails at the reconfiguration stage.


comment:15 Changed 11 months ago by grenville


please point me to the xancil saved job which was used to create /home/d05/cwilliams/pliocene/gc31/ancils/soil/parameters/qrparm.soil
do you have other soil parameter files that you have created and successfully run with - if so, please point to them?


comment:16 Changed 11 months ago by charlie

Hi Grenville,

Just to say that I haven't forgotten about this, but due to the maintenance going on right now I can't give you this information. I don't currently have an xancil saved job for the soil parameters, but I can create one as soon as the system comes back.

I do have another soil parameter file, which I created for the Eocene, which I think works, so I can point you to that as well, as soon as returns.

In the meantime, if it helps, the preindustrial versions are at:


Many thanks, more later (or tomorrow) as soon as the system returns.


comment:17 Changed 11 months ago by grenville

Thanks Charlie - this is quite a puzzle!

comment:18 Changed 11 months ago by charlie

Hi Grenville,

Now that the system has returned, I can follow this up. To answer your first question, the job file to make the soil parameters is at /home/d05/cwilliams/pliocene/gc31/ancils/soil/parameters/job_soil_p.job and the one to make the soil dust fields is at /home/d05/cwilliams/pliocene/gc31/ancils/soil/dust/job.soil_d.job

Both of these are using the "generalised ancillary" option, because this is what I was told to do ages ago by Jeff, in order to make these files. I have just tried to make the parameters file again, and yet again although I am selecting "no mask" in xancil, it is still applying a mask and is still changing my values of 2.0000e+20 to. This does not seem to happen with the soil dust file, which retains its 2.0000e+20 after being made.

To answer your 2nd question, the soil parameters and soil dust fields that I used previously, for my Eocene suite (which does work, albeit unstable) are at:


Neither of these present any problems in either the reconfiguration or the coupled stage i.e. the model runs with these.

I have just compared the parameters file from these, with my new one, and they are identical in terms of the number of fields, field names, etc. The only difference, other than the mask, is that the Eocene version is a global mean everywhere, whereas the Pliocene version is not. What's really weird is that the Eocene version (which, as I said, DOES run) actually contains NaN instead of 2.0000e+20, which is what we thought the problem with the Pliocene version was in the first place!


comment:19 Changed 11 months ago by jeff

Hi Charlie

It looks like this problem is caused by a bug in the reconfiguration, see for further details.

I've backported the fix to vn10.7, include this branch in your job


Hopefully this will fix the problem.


comment:20 Changed 11 months ago by charlie

Hi Jeff,

Sorry for the delay.

I have now tried running with this branch, which I inserted using the GUI (at fcm_make_um > env > Sources > um_sources), but it failed at the the fcm_make_um stage giving me following error:

 33) GC3-PrgEnv/2.0/24708
[FAIL] file:///home/d04/fcm/srv/svn/um.xm/main/branches/dev/jecole/um/vn10.7_uniform_smc_stress_in_recon: not found
[FAIL] svn: warning: W170000: URL 'file:///home/d04/fcm/srv/svn/um.xm/main/branches/dev/jecole/um/vn10.7_uniform_smc_stress_in_recon' non-existent in revision 78959
[FAIL] svn: E200009: Could not display info for all targets because some targets don't exist

[FAIL] fcm make -f /working/d05/cwilliams/cylc-run/u-bk944/work/18500101T0000Z/fcm_make_um/fcm-make.cfg -C /var/spool/jtmp/7953507.xcs00.5QZI4y/fcm_make_um.18500101T0000Z.u-bk944IAYobi -j 6 --archive # return-code=2

Have I inserted this branch in the wrong place?


comment:21 Changed 11 months ago by jeff

Sorry Charlie the branch name should be


You are putting in the right place.


comment:22 Changed 11 months ago by charlie

I have just tried that new pathname, and same error:

[FAIL] file:///home/d04/fcm/srv/svn/um.xm/main/branches/dev/jecole/vn10.7_uniform_smc_stress_in_recon: not found
[FAIL] svn: warning: W170000: URL 'file:///home/d04/fcm/srv/svn/um.xm/main/branches/dev/jecole/vn10.7_uniform_smc_stress_in_recon' non-existent in revision 78962
[FAIL] svn: E200009: Could not display info for all targets because some targets don't exist

[FAIL] fcm make -f /working/d05/cwilliams/cylc-run/u-bk944/work/18500101T0000Z/fcm_make_um/fcm-make.cfg -C /var/spool/jtmp/7956140.xcs00.0crHBm/fcm_make_um.18500101T0000Z.u-bk944s_WS_9 -j 6 --archive # return-code=2
2019-12-12T17:15:18Z CRITICAL - failed/EXIT

comment:23 Changed 11 months ago by jeff

Sorry again, third time lucky, the branch name should be



comment:24 Changed 11 months ago by charlie

Hi Jeff,

Right then, that seems to have worked, and the reconfiguration went through fine. Even more to my surprise (and I genuinely am surprised), the coupled stage has now been going for over 20 minutes, which therefore implies that there was nothing wrong with my ancillary files in the first place. Whenever there has been a problem here, it always fails at the first time step i.e. 6 minutes in.


The only thing I don't understand, at least not entirely, is what the problem was. I read through the ticket you pointed me to, but didn't quite understand the problem or what you have done to resolve it. Why was it a problem when I use my newly created ancillaries, when it didn't happen with the standard PI versions? Please can somebody explain in more detail, so I know exactly what's going on here and now it was resolved?

Many thanks,


comment:25 Changed 11 months ago by jeff

Hi Charlie

You have use_smc_stress set to true, which converts smc to smc_stress in routine Rcf_Pre_Interp_Transform and reverses this in Rcf_Post_Process_Atmos. Before it does this it checks whether this actually needs to be done by checking whether the 3 fields, VOL SMC AT …, are the same in the input dump and ancillary file.

It was while doing the check on these fields that the error happened. Each processor checks its local data and returns a flag saying whether the data is the same or not (in routine Rcf_smc_stress). If the fields are identical then it resets use_smc_stress to false and there is no problem, maybe this was the case for your standard PI versions? If the fields are different, as for your run here, then it should keep use_smc_stress set to true and continue on. The bug is that it is possible that on some processors the fields are locally different and on others they are the same (if there is no land on the processor that the fields always come back identical). This means use_smc_stress is true on some processors and false on others, so different processors take different paths though the code. The processors with use_smc_stress true attempt to write out a global smc field, which involves a gather to processor 0, but not all processors are involved in this (those who have use_smc_stress false), so a deadlock occurs.

This problem is fixed by setting use_smc_stress to true on all processors, if it is true on any processor, which is what the branch does. This was fixed in vn11.0 I believe and my branch backports this fix to vn10.7. This is a good reason to keep UM versions up to date.

Hopefully this makes sense and explains things clearly.


comment:26 Changed 9 months ago by ros

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.