Opened 7 years ago

Closed 7 years ago

#1098 closed help (fixed)

large drift in HadGEM2 concides with change to phase2b..

Reported by: swr04ojb Owned by: ros
Component: UM Model Keywords:
Cc: Platform: MONSooN
UM Version: 6.6.3

Description

Hello,

for the LastMil? project we are running HadGEM2 from 850 through to 1850. The relevant job-ids are..

  • 861- 949 — xfkmi (alessio)
  • 948-1049 — xfkmk (")
  • 1048-1149 — xfkml (")
  • 1148-1249 — xfkmm (")
  • 1248-1349 — xfkmn (")
  • 1348-1400 — xhina (andrew)
  • — phase2b, switch from ibm00 to ibm02 —
  • 1395-1449 — xhinb/c (andrew)
  • 1448-1499 — xinha (oliver)

Andrew and I have been looking at the results, and we notice that xhinb/c has a substantial drift away from xhina: the global-average of 1.5m temp drifts several degrees downward over a 5 year period.

So far we only have analysed up to 1405. We are pulling back more data, but it is largely irrelevant, as the 1395-1400 periods of the two runs should directly match up but don't.

I can't see any changes in the drivers/dumps b/wn xhina and xhinb, though perhaps I've missed something.

Should we be surprised/expecting the phase2b change to have had this affect? Could we perhaps have missed something that we should have included/removed as a consequence of the move to phase2b?

Any advice would be greatly welcomed,

kind regards,

oliver

Change History (10)

comment:1 Changed 7 years ago by ros

Hi Oliver,

We weren't aware of there being any differences and I have contacted a colleague at the Met Office who said the following:

"We have found no differences as far as I'm aware in all our UM versions going from the 1C to the 2C.

It is potentially due to a genuine bug that is behaving differently. For example, if a calculation is dependent on uninitialised data it may give different answers because the uninitialised data is different.

We have had problems (failures rather than different results) with some versions of HadCM3 which seem likely to be due to this."

Hope this helps,
Regards,
Ros.

comment:2 Changed 7 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Oliver,

I also should have said you need to make sure you are comparing output from jobs that have been run in exactly the same way; that they are running in exactly the same sized chunks and if stopped and restarted were done so at exactly the same point. NRUNs and CRUNs will likely not bit compare as different initialising routines are run.

Regards,
Ros.

comment:3 Changed 7 years ago by swr04ojb

Hi Ros,

we decided to try to skip to the solution and so I restarted the run, from an earlier point (1390, rather than 1397). I see exactly the same problem Andrew does. There's a drop in toa olr, clear-sky toa olr, cloud amount, SAT, from the first month. As far as we can see it could be..
(a) a problem with dumps
(b) a problem with ancillaries,
© a problem with the executable.

Having started from different dumps I think we can rule out (a), and probably (b). So, the executable — is it a problem to be using a binary compiled on the ibm00 machine on the ibm02 machine? Should we have recompiled?

kind regards,

oliver

comment:4 Changed 7 years ago by ros

Hi Oliver,

Executables compiled on the 1C should run with no problems on the 2C. I haven't heard of there being any issues with this.

Regards,
Ros

comment:5 Changed 7 years ago by swr04ojb

Hello again Ros,

thanks for getting back to us quickly. What you say about the executables is what I had understood your earlier email to say but I wanted to confirm that was the case.

The simulation is different from the very first month. Can you make any suggestions as to what might be going wrong? As I figure it, if the executable should be fine, and the dumps are fine, then does that just leaves the job config and the ancillaries? I have diff'ed the jobs (e.g. xhina and xhinb) and can't spot anything usual (ie just change of machine name and start dump/time-of-start). Should I perhaps be doing something different with the ancillary files, because I'm starting a part of the way through? (e.g. we have Ozone ancillaries that are 1 century long).

kind regards,

oliver

comment:6 Changed 7 years ago by ros

Hi Oliver,

I assume the xhina job was started with an NRUN from 1348 and then CRUNs following that to completion?

You say you have now started from 1390 - Is that with an NRUN? If so then I wouldn't necessarily expect you to get the same results. As I said above NRUNs and CRUNs are unlikely to bit compare.

So assuming xhina was started with an NRUN from 1348. If you rerun from 1348 starting with an NRUN do you get the same results or different for the first chunk of the run?

Cheers,
Ros.

comment:7 Changed 7 years ago by swr04ojb

Hi Ros,

all these runs were NRUNS, then CRUNS. xhina was started from xkfmm. It crashed at 1397 (as the IBMO0 machine ended). It was restarted as xhinb, from an xhina dump around 1395.

xhinb was spotted to be drifting away from the climatology of xhina, so we started a new job (xinhb) from the dumps in xhina at 1390. That job too is now drifting.

In both cases, by "drifting", I am talking about an 8K drop in the NH mean 1.5m SAT, over a ten year period. These restarted runs (jobs xhinb and xinhb) are very clearly different climates to the original (job xhina).

I understand that CRUNs and NRUNS are not bit compatible, but we expect them to be scientifically compatible no? 8K in 10 years, in a hemispheric mean, seems much larger than I would expect.

I haven't yet tried running from 1348, but I can do so. I would create a new job, copying say xinhb to xinhd. I'd then change the start dumps to be the correct ones. To start from an NRUN with no previous CRUN, am I right in thinking I would hit process in the umui, then save-and-quit, manually edit the SUBMIT file, then reopen the umui and hit submit? Will that work without the CRUN step?

kind regards,

oliver

comment:8 Changed 7 years ago by swr04ojb

Hi Ros,

okay, I think we have discovered the root of the problem - a missed handedit file, that had a hard-coded switch in it. I've just set a run going to test that, so should know tomorrow whether it has worked.

On a side-note, the hand-edit was a shell script that wrapped a python script. It seemed unneccesary, so instead I called it directly from the umui, as per..

python /home/swr04ojb/hand_edits/modify_gas3.py RUNID /home/abozzo/hand_edits/gas_input.dat

(see job xinhc)

this seems to run fine (the output in EXT_SCRIPT_LOG) is correct, and the alterations to files seems to be correct. But it produces an odd error when processed..

"python /home/swr04ojb/hand_edits/modify_gas3.py RUNID /home/abozzo/hand_edits/gas_input.dat does not exist. For more information look at EXT_SCRIPT_LOG file."

Any thoughts on how to avoid this error message popping up?

kind regards,

oliver

comment:9 Changed 7 years ago by ros

Hi Oliver,

Glad to hear that you tracked down the root of the problem. The reason you get that error popping up is because the umui is doing a check to see if the hand-edit file you entered in the table exists. You have entered a command (which I admit, the umui window does invite you to do!), but which I have never seen done, so it is then running a file exists check on python /home/swr04ojb/hand_edits/modify_gas3.py RUNID /home/abozzo/hand_edits/gas_input.dat which obviously doesn't exist.

Regards,
Ros.

comment:10 Changed 7 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.