Opened 4 years ago

Closed 4 years ago

#1651 closed help (fixed)

identical jobs don't bit compare

Reported by: ggxmy Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 8.4

Description

On behalf of Kirsty Pringle

Here at Leeds we are struggling with a modelling conundrum and we wondered if you might be able to advise us?


I think you know that we are planning to run a series of ensemble simulations where we look at the effect of changes to the aerosol parameters on the calculated forcing. The model is to be run nudged double call with all feedbacks from the aerosol to the meteorology turned off (so each ensemble member will have identical meteorology). But we are consistently finding with different runs we get slightly different meteorology.


So we ran a test where we ran exactly the same job twice – same executable, same namelist parameters, same start dump and no parameter perturbations.


And we found that the two *identical* jobs gave different results. The meteorology (wind and temperatures) is subtly but noticeably different. This shouldn’t happen should it? Would you expect the same job / executable run twice to give binary identical results?


We are running v8.4 CheST+GLOMAP with nudging and with the ACTIVATE CDN parameterisation on ARCHER on the same number of nodes each time.


Is this something that you have ever come across? We are going to do a few tests, e.g. turn UKCA off, turn ACTIVATE off but I thought this might be something that you might have experience with.


Thanks!

Kirsty

Attachments (2)

diff_CCN_tdyup_tdyuo.jpg (80.5 KB) - added by ggxmy 4 years ago.
diff_temp_tdyup_tdyuo_20080602.jpg (106.7 KB) - added by ggxmy 4 years ago.

Download all attachments as: .zip

Change History (30)

comment:1 Changed 4 years ago by grenville

Possible OpenMP problem - awaiting jobids

Grenville

Changed 4 years ago by ggxmy

comment:2 Changed 4 years ago by ggxmy

tdyuo and tdyup are copies of tdyon, given our median parameters (in CNTLATM as AEROS_*), uses copies of the executable for tdyon and the same dump file from tdyon, and were run as two members in an ensemble.

Well, that's what I believe I did, but I think we shouldn't exclude any possibility. Of course any problem could have been caused by a human error.

Anyway, these give different results. For example, the attached figure shows the difference in CCN at 2 km elevation. We see relatively large differences in tropics. We also see very similar patterns in other output fields.

tdyra is also same as tdyuo and tdyup but created and run ealier as a member of a different ensemble. The results are different from tdyuo and tdyup, and the differences from tdyuo and tdyup show a very similar (but not identical) pattern.

Thanks,
Masaru

comment:3 Changed 4 years ago by ggxmy

These jobs are not on UMUI. They were created on ARCHER.

Should I create a copy of a job on UMUI and run it from there? Maybe that might give us some clue although our aim is creating copies of a job on ARCHER.

Masaru

comment:4 Changed 4 years ago by ggxmy

Sorry, I misunderstood Grenville's comment made offline. He needed jobs in umui_runs not on UMUI.

OK then they are on ARCHER:/home/n02/n02/masara/umui_runs/tdyu#-239185245 where #=o and p.

These were created by copying and modifying tdyon-239185245.

Thanks,
Masaru

Last edited 4 years ago by ggxmy (previous) (diff)

comment:5 follow-up: Changed 4 years ago by grenville

Masaru

Thanks - I'm running some tests.

Please tell me the UMUI job id for the model build

Grenville

comment:6 follow-up: Changed 4 years ago by grenville

Masaru

I ran the two jobs in /home/n02/n02/masara/umui_runs/tdyu#-239185245 — each ran for 144 time steps and produced identical results.

How long did you run before the models diverged?

Grenville

comment:7 in reply to: ↑ 5 Changed 4 years ago by ggxmy

Replying to grenville:

Masaru

Thanks - I'm running some tests.

Please tell me the UMUI job id for the model build

Grenville

Hi Grenville,

tdyon is the base job created on UMUI and submitted from there. It is copied and modified on ARCHER to create tdyu#.

Thank you.
Masaru

comment:8 in reply to: ↑ 6 Changed 4 years ago by ggxmy

Replying to grenville:

Masaru

I ran the two jobs in /home/n02/n02/masara/umui_runs/tdyu#-239185245 — each ran for 144 time steps and produced identical results.

How long did you run before the models diverged?

Grenville

Grenville,

I ran them for two months and checked the monthly means from the second month (July 2008).

By the way, I tried running two identical un-nudged (fee running) jobs. They gave identical results.

On UMUI, I copied tdyoo as tdyos and remove settings for nudging from there. The submission ID is tdyos-255174112. Then on ARCHER, I made two copies of it as tdzf#-255174112 where #=x,y and ran them as an ensemble.

So nudging may be doing something weird.

Thanks,
Masaru

comment:9 Changed 4 years ago by grenville

Am I correct in thinking that the 2 jobs I ran (copies of tdyuo and tdyup) have nudging on and that running for 144 time steps (2 days), there will have been several nudgings?

Grenville

comment:10 Changed 4 years ago by grenville

  • Reporter changed from grenville to ggxmy

Changed 4 years ago by ggxmy

comment:11 Changed 4 years ago by ggxmy

I think U, V and T are nudged at every 6 hours.

I checked my daily outputs from tdyuo and tdyup at the end of second day (20080602). The surface temperatures are already different.

Please check the attached figure. I extracted "temp" from tdyu#a.pa20080602 using xconv to NetCDF files, copied them to JASMIN, took difference between them using nco, and opened it with ncview.

Masaru

comment:12 Changed 4 years ago by grenville

cumf is the utility to compare UM files - at the bit level (it should be in your PATH);

I ran cumf /work/n02/n02/grenvill/um/tdyuo/tdyuoa.pa20080602 /work/n02/n02/grenvill/um/tdyup/tdyupa.pa20080602

with the result:

COMPARE - SUMMARY MODE

———————————-

Number of fields in file 1 = 4657
Number of fields in file 2 = 4657
Number of fields compared = 4657

FIXED LENGTH HEADER: Number of differences = 3
INTEGER HEADER: Number of differences = 0
REAL HEADER: Number of differences = 0
LEVEL DEPENDENT CONSTANTS: Number of differences = 0
LOOKUP: Number of differences = 4657

DATA FIELDS: Number of fields with differences = 0

So, I think we aren't considering the same runs — please point me to your tdyu#a.pa20080602 files (they aren't in /work/n02/n02/masara/um/tdyu#)

Grenville

comment:13 Changed 4 years ago by ggxmy

My outputs are in /work/n02/n02/masara/Dumps/tdyu[op]

Masaru

comment:14 Changed 4 years ago by ggxmy

Thanks.

cumf didn't work for me…on espp2 but OK, it did work on login node.

masara@eslogin005:/work/n02/n02/masara/Dumps> cumf tdyuo/tdyuoa.pa20080602 tdyup/tdyupa.pa20080602

Number of fields in file 1 =  4657
Number of fields in file 2 =  4657
Number of fields compared  =  4657

FIXED LENGTH HEADER:        Number of differences =       1
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       0
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =    4658
DATA FIELDS:                Number of fields with differences =    3866

So the situation is different between you and me…. That might mean the problem is not in the model or data but in my environment?

Thanks.
Masaru

comment:15 Changed 4 years ago by grenville

Masara

Your file /work/n02/n02/masara/Dumps/tdyuo/tdyuoa.pa20080602

is not the same as /work/n02/n02/grenvill/um/tdyuo/tdyuoa.pa20080602

I'm struggling to believe that we are running the same model — do you have the leave file for tdyuo?

Grenville

comment:16 Changed 4 years ago by ggxmy

That's consistent with the problem I have —- identical jobs give different results.

The .leave file for this run is

/home/n02/n02/masara/output/tdyuo000.tdyuo.d15239.t185256.leave.20150907-141045

Lines starting from "uo>" are outputs for tdyuo and those from "up>" are for tdyup.

Masaru

comment:17 follow-up: Changed 4 years ago by grenville

Masara

Our results differ because I was running with 96 processors (to fit in the sort queue) I'd forgotten that UKCA doesn't bit compare over different processor decompositions.

I have rerun with 144 processors - both of my runs (tdyuo, tdyup) produced identical results and identical to your tdyuo run.

Your tdyup run starts to differ from tdyuo at timestep 87 — you can see this from the leave files is that a significant time step?

You said earlier:

On UMUI, I copied tdyoo as tdyos and remove settings for nudging from there. The submission ID is tdyos-255174112. Then on ARCHER, I made two copies of it as tdzf#-255174112 where #=x,y and ran them as an ensemble.

I'm not clear what "ran them as an ensemble" means - what happens if you switch on nudging in these two jobs and run them for 144 timesteps - do they produce different results.

Until I can reproduce your problem, I'll struggle to find its source.

Grenville

comment:18 Changed 4 years ago by Leighton_Regayre

Grenville,

My jobs xluc.g and xluc.h are identical jobs submitted through the umui. They also produce different results. Because they don't use Masaru's 'ensemble' submission process they may be easier to work with.

These jobs are a single-call adaptation of Masaru's double-call nudged job. They've been manually compiled.

Leighton.

comment:19 in reply to: ↑ 17 Changed 4 years ago by ggxmy

Replying to grenville:

Hi Grenville,

Thank you for your help. Your runs of tdyuo and tdyup produced identical results to my tdyuo!! So something happened to my tdyup? Strange!

I can see at timestep 87, initial Absolute Norm is the same in both of my jobs but the Final Absolute Norm is different. Is this what you mean? I don't know what happens at timestep 87 but does not at earlier time steps.

By "ran them as an ensemble" I mean I ran these jobs as a one big job, or ran using the modified version of ensemble_ARCHER.sh script originally written by Simon and you sent to me earlier.

If I put nudging back to those jobs they will be the same as my other jobs. I'm quite sure that they will give me the same problem — identical jobs, different results.

Thanks,
Masaru

comment:20 Changed 4 years ago by grenville

Masaru

Leighton has provided jobs which also diverge - I'm looking at these.

I think it's worth running your 2 test jobs with nudging on (they take < 20 mins run).

Grenville

comment:21 Changed 4 years ago by ggxmy

To turn nudging back on, I have to go back to the base job on UMUI because I removed a hand edit and a branch related to nudging.

Then I may create a job by simply turning off nudging without changing anything else, create two copies on ARCHER, run them and then turn nudging back on….That's the same thing as tdyu.o and p after all….I doubt if its worth doing.

OK, maybe I can simply turn on nudging in tdzf[xy] without worrying about the hand edit and the branch I removed. That may run OK.

[Edit] These jobs didn't run in this simple way.

Masaru

Last edited 4 years ago by ggxmy (previous) (diff)

comment:22 Changed 4 years ago by grenville

Masara

I'm trying to determine if the problem is reproodicible.

Grenville

comment:23 Changed 4 years ago by ggxmy

Just a short update.

I ran tdyu.p again by itself this time. The result is different from my o (= Grenville's o and p) and my original p!!! That simply made the third p. This time the run started to divert in time step 99.

Masaru

comment:24 Changed 4 years ago by ggxmy

Grenville,

Have you tried running the jobs a bit longer? It looks like runs diverge at different timings from one time to another. Isn't it a possibility that it just didn't happen within the first 144 time steps in your case?

Masaru

Version 0, edited 4 years ago by ggxmy (next)

comment:25 Changed 4 years ago by grenville

Masaru

I have a case where jobs produced different results - but as yet no idea why.

Grenville

comment:26 Changed 4 years ago by grenville

Masaru, Leighton

I ran Leighton's job a few times - (/work/n02/n02/grenvill/um/xlkva). Comparing two dumps has revealed something interesting - here's the summary from cumf - I think we've been very lucky here in that we have found the first occurrence of a difference in the model runs:

COMPARE - SUMMARY MODE

———————————-

Number of fields in file 1 = 38651
Number of fields in file 2 = 38651
Number of fields compared = 38651

FIXED LENGTH HEADER: Number of differences = 4
INTEGER HEADER: Number of differences = 0
REAL HEADER: Number of differences = 0
LEVEL DEPENDENT CONSTANTS: Number of differences = 0
LOOKUP: Number of differences = 0
DATA FIELDS: Number of fields with differences = 4

Field 103 : Stash Code 3 : V COMPNT OF WIND AFTER TIMESTEP : Number of differences = 1

Field 38329 : Stash Code 39012 : NON-STANDARD FIELD : Number of differences = 1

Field 38414 : Stash Code 39013 : NON-STANDARD FIELD : Number of differences = 1

Field 38584 : Stash Code 39015 : NON-STANDARD FIELD : Number of differences = 1

files DO NOT compare

Stash codes 39 come are nudging diagnostics.

I have put a plot of the error in v in /work/n02/n02/grenvill/um/xlvka/v-wind-diff (this is a postscript file).

So it appears that nudging is causing the problem - are you in contact with the owners of the nudging code?

Grenville

comment:27 Changed 4 years ago by ggxmy

After off-line discussions, Simon and Grenville created a bug-fix to the nudging code;

fcm:um_br/dev/simon/vn8.4_nudging_tropopause_fix/src

Including this branch, two identical jobs tdzip and tdziq seems to produced results which seem to be bit compare after running 60 days.

In .leave file, Absolute Norms are identical down to time step 4500. Difference plots show nothing.

Below I did cumf on the daily output files in the end of second month of the simulations.

masara@eslogin003:/work/n02/n02/masara/Dumps> cumf tdzip/tdzipa.pa20080130 tdziq/tdziqa.pa20080130
CUMF successful
Summary in:                        /work/n02/n02/masara/tmp/tmp.eslogin003.33839/cumf_summ.masara.d15275.t121053.46283
Full output in:                    /work/n02/n02/masara/tmp/tmp.eslogin003.33839/cumf_full.masara.d15275.t121053.46283
Difference maps (if available) in: /work/n02/n02/masara/tmp/tmp.eslogin003.33839/cumf_diff.masara.d15275.t121053.46283
masara@eslogin003:/work/n02/n02/masara/Dumps> less /work/n02/n02/masara/tmp/tmp.eslogin003.33839/cumf_summ.masara.d15275.t121053.46283

  COMPARE - SUMMARY MODE
 -----------------------

Number of fields in file 1 =  4827
Number of fields in file 2 =  4827
Number of fields compared  =  4827

FIXED LENGTH HEADER:        Number of differences =       2
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       0
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =    4827
DATA FIELDS:                Number of fields with differences =       0
 files DO NOT compare

Although it says "files DO NOT compare", differences are only in FIXED LENGTH HEADER and LOOKUP, and the number of differences in LOOKUP is the same as Number of fields. So probably this doesn't mean any substantial difference?

Thank you so much for your help.
Masaru

comment:28 Changed 4 years ago by ggxmy

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.