Opened 3 months ago

Closed 3 months ago

#3036 closed help (fixed)

Not getting Identical result after restarting a job using dump file

Reported by: akpandeyjnu Owned by: um_support
Component: UKCA Keywords: bit comparison
Cc: Platform: ARCHER
UM Version: 11.0

Description

Hello,

After analysing data from u-bj506 run I want to explore the boundary layer depth in the model (which is not added in the u-bj506 job). To do so, I copied the u-bj506 job, added boundary layer diagnostics and used a dump file (bj506a.da20040901_00) to initiate a new run.

I have done the following steps:
A) copy the dump file (bj506a.da20040901_00) to my work directory (/work/n02/n02/alok)
B) copied suite u-bj506 —> u-bn439

I have done the following changes in u-bn439:
1) changed start dump to bj506a.da20040901_00 (um —> namelist —> Model Input and output —> Dumping and Meaning —> astart)
2) changed model basis time to 20040901T0000Z (Suite conf —> Run Initialisation and Cycling —> Model basis time)
3) added an hourly boundary layer diagnostic
4) switch on monthly boundary layer diagnostic
5) turn on the reconfiguration and run the model

Model is running good and archiving results as well on rdf. I have checked the surface temperature in the previous job and current job on a particular date and time (2005/01/01:01.00) and getting non-identical results (they are similar but not exactly the same).

I have used xconv to visualize results of a new run (u-bn439) and old run (u-bj506). I am attaching a screenshot of the results. The left side is from new run (u-bn439) and right is from old run (u-bj506).Both outputs are similar but the temperature range is different. 205.96 K - 330.72 K in the new job and 202.42 K - 331.60 K in the old job.

the sample directory for new job:
/nerc/n02/n02/alok/archive/u-bn439/20050101T0000Z/bn439a.pb20050101.pp
the sample directory for old job:
/nerc/n02/n02/alok/archive/u-bj506/20050101T0000Z/bj506a.pb20050101.pp

I am expecting exactly identical results. Why they are not same? Am I doing something wrong?

Regards, Alok

Attachments (3)

CMS_help.PNG (217.2 KB) - added by akpandeyjnu 3 months ago.
UM_inputs.png (4.7 KB) - added by willie 3 months ago.
Reconfiguration and UM inputs
UKCA_non_identical_result_issue.PNG (243.0 KB) - added by akpandeyjnu 3 months ago.

Download all attachments as: .zip

Change History (10)

Changed 3 months ago by akpandeyjnu

comment:1 Changed 3 months ago by willie

Hi Alok,

In u-bj506/rose-suite.conf you have switched the reconfiguration off, so you're not processing identical inputs.

Willie

comment:2 Changed 3 months ago by akpandeyjnu

Hi Willie,

Thanks for the response.

I have submitted a new job (u-bn600) without switching on reconfiguration and still getting a non-identical result.

The 'fcm diff' is the following:


—- app/um/rose-app.conf (revision 134148)
+++ app/um/rose-app.conf (working copy)
@@ -1183,7 +1183,7 @@

[namelist:nlcfiles]
!!alabcin1='unset'
!!alabcin2='unset'

-astart='/work/n02/n02/alok/bj506a.da20141201_00'
+astart='/work/n02/n02/alok/bj506a.da20040901_00'

atmanl='unset'
!!iau_inc='unset'
!!obs01='unset'

@@ -3799,7 +3799,7 @@

tim_name='TMPMN00'
use_name='UPMEAN'

-[namelist:umstash_streq(00025_2c52ca2b)]
+[namelist:umstash_streq(00025_2c52ca2b)]

dom_name='DIAG'
isec=0
item=25

@@ -3807,6 +3807,14 @@

tim_name='TDMPMN'
use_name='UPMEAN'

+[namelist:umstash_streq(00025_d25265b9)]
+dom_name='DIAG'
+isec=0
+item=25
+package='Dump Mean Diagnostics'
+tim_name='T1H'
+use_name='UPB'
+

[namelist:umstash_streq(00028_301cdcd4)]
dom_name='DIAG'
isec=0

Index: rose-suite.conf
===================================================================
—- rose-suite.conf (revision 134148)
+++ rose-suite.conf (working copy)
@@ -8,7 +8,7 @@

!!ACCOUNT_MONSOON=
!!ACCOUNT_USR=

ANCIL_OPT_KEYS=

-ARCHER_GROUP='n02-chem'
+ARCHER_GROUP='n02-NES009019'

ARCHER_QUEUE='standard'
ARCH_LOG=false
ARCH_WALL=false

@@ -15,7 +15,7 @@

ATM_PPN=24
ATM_PROCX=24
ATM_PROCY=15

-BASIS='20141201T0000Z'
+BASIS='20040901T0000Z'

BITCOMP_NRUN=false
BUILD_UM=true
CALENDAR='gregorian'


I have deleted the previous dump file and copied from the archive one. Further, I have used chmod 444 to make it read-only. Model is running well producing similar results but they are not exactly identical. Am I still doing something wrong?

Regards, Alok

Changed 3 months ago by willie

Reconfiguration and UM inputs

comment:3 Changed 3 months ago by willie

Hi Alok,

I think one of the dumps has been overwritten. The situation is this,

u-bj506

BASIS 20141201T0000Z
RCF FALSE
AINITIAL /work/n02/n02/ukca/initial/N96eL85/au917a.da20080901_00
astart /work/n02/n02/alok/bj506a.da20141201_00

u-bn439

BASIS 20040901T0000Z
RCF TRUE
AINITIAL as above
astart /work/n02/n02/alok/bj506a.da20040901_00

In u-bj506 AINITIAL is irrelevant because the reconfiguration is off.

In u-bn439 the configuration is on, so the reconfigured AINITIAL overwrites astart i.e damaging /work/n02/n02/alok/bj506a.da20040901_00.

So u-bj506 UM processed the 20141201_00 dump whereas u-bn439 processes the reconfigured 20080901_00.

I have attached a little diagram showing the relation between AINITIAL and ASTART. If you switch the reconfiguration off then you need to set ASTART appropriately in this suite.

Willie

comment:4 Changed 3 months ago by akpandeyjnu

Hi Willie,

Thanks for your reply.

I have copied the u-bj506 and created a new job u-bn906.

I have done the following changes in u-bn906:
1)suite conf —> Host Machine —> Archer —> Account group for HPC tasks (n02-NES009019)
2)um —> namelist —> Model Input and output —> Dumping and Meaning —> astart (/work/n02/n02/alok/bj506a.da20030901_00): change the dump file
3)suite conf —> Run Initialisation and Cycling —> Model basis time (20030901T0000Z)

The difference in the both jobs are as follows:


akpandeyjnu@puma:/home/akpandeyjnu/roses> rose config-dump -C u-bj506
[INFO] chdir: u-bj506/
akpandeyjnu@puma:/home/akpandeyjnu/roses> rose config-dump -C u-bn906
[INFO] chdir: u-bn906/
akpandeyjnu@puma:/home/akpandeyjnu/roses> diff -r u-bj506 u-bn906
Only in u-bj506/app/um/opt: rose-app-gregorian.conf~ diff -r u-bj506/app/um/rose-app.conf u-bn906/app/um/rose-app.conf 1186c1186
< astart='/work/n02/n02/alok/bj506a.da20141201_00'
—-

astart='/work/n02/n02/alok/bj506a.da20030901_00'

Only in u-bj506/app/um: rose-app.conf~
diff -r u-bj506/rose-suite.conf u-bn906/rose-suite.conf
11c11
< ARCHER_GROUP='n02-chem'
—-

ARCHER_GROUP='n02-NES009019'

18c18
< BASIS='20141201T0000Z'
—-

BASIS='20030901T0000Z'

diff -r u-bj506/rose-suite.info u-bn906/rose-suite.info
7c7
< description=Copy of u-bj309/trunk@119208
—-

description=Copy of u-bj506/trunk@133544

10c10
< project=UKCA11.0_Nudged_cp_309
—-

project=UKCA11.0_Nudged_cpu-bj506

Only in u-bj506/site: archer.rc~
Only in u-bn906/.svn/pristine: 01
Only in u-bj506/.svn/pristine: 43
Only in u-bj506/.svn/pristine: 65
Only in u-bj506/.svn/pristine: d5
Files u-bj506/.svn/wc.db and u-bn906/.svn/wc.db differ
akpandeyjnu@puma:/home/akpandeyjnu/roses>


The reconfiguration is switched off in both runs. I have not changed anything except dump file and model basis time, but still getting different results. I am attaching a screenshot of results. In my understanding, it must be exactly same. what else I do to get exactly same results!

Regards, Alok

Changed 3 months ago by akpandeyjnu

comment:5 Changed 3 months ago by willie

Hi Alok,

I think I understand a little better what has happened even if I don't know the answer.

The suite u-bj506 has been run for 18 years and 3 months, but this has been done in at least two steps, one beginning in 19970901 and a second beginning 19880901 and lasting the standard length of 20 years and 6 months. The first month in the archive is 19970901 and the last is 20151101. So there must have been a third step. At any rate the rose suite u-bj506 does not match the data archived. This makes detailed analysis difficult.

The suite u-bn906 was then seeded with a start dump from the u-bj506 archive (bj506a.da20030901_00) and run for 15 months. Finally the picture of UKCA non identical results was produced by comparing the files pb20030925.pp in the u-bj506 archive with the corresponding one in the u-bn906 archive and as you say they are slightly different.

In theory, I think they should be identical had the u-bj506 run been continuous. You may also need to set the bit compare flag in suite conf → Run initialisation and cycling page to achieve this - see the help on that page.

I hope that helps.

Willie

comment:6 Changed 3 months ago by akpandeyjnu

Hi Willie,

Many thanks for the reply. It really helps.

I have changed only 'BITCOMP_NRUN=true' (suite conf —> Run Initialisation and Cycling) and now I am getting exactly same results.

You may close this ticket. Thank you

Regards, Alok

comment:7 Changed 3 months ago by willie

  • Keywords bit comparison added
  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.