Opened 7 weeks ago

Last modified 5 weeks ago

#2949 accepted help

restarting from JULES dump files

Reported by: pmcguire Owned by: pmcguire
Component: JULES Keywords: JULES, spinup
Cc: dhertwig Platform: JASMIN
UM Version:

Description

Hi Patrick,

I'm doing some last tests/analyses with regards to the offline
evaluation paper at the moment.

Maggie had suggested to adjust the initial states of the model runs to
have a look at how this affects the comparison. Naturally there are some
differences in the spun-up states for each configuration (reflecting the
response of the model to the different settings/ancillaries), but she is
right that it would be good to quantify the differences.

So far I have always included a spin-up cycle for the model runs, so I
have dump files for each configuration we used.

I'm now having a go at adjusting the namelist to have the model start
from a dump file. Eventually I will adjust the soil temperature and
wetness so a common state, but for now I just wanted to do some sanity
checks with using the original dump file and comparing the run results
with the previous output.

I found your tutorial how to do this for a global run
(https://research.reading.ac.uk/landsurfaceprocesses/software-examples/jules-fluxnet2015-jules-global/runningglobal/)
and I believe have adjusted the namelists (jules_initial & jules_spinup)
accordingly, but I cannot get the model to compile.

I get the error below (at the fcm_make stage), which I can't really
decipher:

[FAIL] preprocess.prop{fpp.defs}[] = SCMA BL_DIAG_HACK COMPILER_INTEL: 
bad name-space
[FAIL] preprocess.prop{file-ext.h}[] = .inc: bad name-space
[FAIL] fcm make -f 
/work/scratch/dhertwig/cylc-run/u-ba499/work/1/fcm_make/fcm-make.cfg -C 
/home/users/dhertwig/cylc-run/u-ba499/share/f
cm_make -j 4 # return-code=2
2019-07-05T13:52:21Z CRITICAL - failed/EXIT

This is my rose-app.conf file:
https://code.metoffice.gov.uk/trac/roses-u/browser/b/a/4/9/9/trunk/app/jules/rose-app.conf?rev=122789

And this is the changeset (i.e. from using spinup to starting from a dump):
https://code.metoffice.gov.uk/trac/roses-u/changeset/122789

My run starts at main_run_start='2011-01-01 01:00:00'

And I have used the dump file that was written out at the start of the
previous main run (after 10 spinup cycles), jules.dump.20110101.3600.nc,
but I also tested using the last spinup file
(jules.dump.spin10.20110101.3600.nc), which have very similar values.

I have not adjusted the start time in jules_times as the time for the
dump file (20110101.3600) agrees with the run start date ('2011-01-01
01:00:00').

I'm a bit at a loss why this is not working as it seems more or less
straightforward.

Do you have an idea what might have gone wrong here from your experience
with starting from dump files?

Many thanks,
Denise

Change History (16)

comment:1 Changed 7 weeks ago by pmcguire

  • Status changed from new to accepted

comment:2 Changed 7 weeks ago by pmcguire

Hi Denise:
Can you give me permission on JASMIN to read your roses and cylc-run directories? You might need to give me permission to read your root directory as well. One way to do it is to make your permissions match my directory (/home/users/pmcguire). I would also like to have access the directory with your dumpfiles and output files and Rose/Cylc log files. You can make sure you keep as private any other directories that you want to be private.

Also, you might look at this CMS ticket: http://cms.ncas.ac.uk/ticket/2342
It has the similar bad namespace error message. But it uses the FLUXNET JULES suite instead of the global gridded JULES suite.
Not sure if the resolution to your problem would be related to what was discussed in that ticket.
Patrick

comment:3 Changed 7 weeks ago by pmcguire

Hi Denise:
Since I don't have access to your JULES source code, the following attempt to see if I can replicate your error message fails:

mkdir tmp_denise
export JULES_SOURCE=/home/users/dhertwig/jules/vn5.2_t610_fix_Sw
fcm make -f  /work/scratch/dhertwig/cylc-run/u-ba499/work/1/fcm_make/fcm-make.cfg  -C tmp_denise/ -j 4

But your suite seems to be derived from Kerry Day's GL4 Loobos suite, which Jessica Brown was working on with me and you, so this works:

mkdir tmp_denise    #if not already done
export JULES_SOURCE=/home/users/kday002/JULES/vn4p7/trunk
fcm make -f  /work/scratch/dhertwig/cylc-run/u-ba499/work/1/fcm_make/fcm-make.cfg  -C tmp_denise/ -j 4

Maybe you can try this with your own JULES source code?
Patrick

comment:4 Changed 7 weeks ago by pmcguire

Hi Patrick:
I have set chmod -R g+rX dhertwig for my home dir; so hopefully you can access the folders now?
The output is in the jasmin2/jules gws, so you should be able to access it.

Paths:
New (not working with start from dump)
Suite with dump-file crash: /home/users/dhertwig/roses/u-ba499
Cylc: /home/users/dhertwig/cylc-run/u-ba499
Output: /group_workspaces/jasmin2/jules/dhertwig/u-ba499/temp

Original (working with spinup)
Copy suite of original config. of u-ba499 suite before starting from dump (paper results using spinup; this still runs ok): /home/users/dhertwig/roses/u-bk400
Cylc: /home/users/dhertwig/cylc-run/u-bk400
Original output / dump files used in u-ba499: /group_workspaces/jasmin2/jules/dhertwig/u-ba499/116255
Original error/output logs of original suite/config: /group_workspaces/jasmin2/jules/dhertwig/u-ba499/116255/job_out

Let me know if I forgot something or it's not what you wanted.

I will have a look at the ticket link you've sent.

Many thanks,
Denise

comment:6 Changed 7 weeks ago by pmcguire

Hi Denise
I seem to be able to compile your JULES code interactively on jasmin-sci1 without a problem, using
the commands:

mkdir tmp_denise
export JULES_SOURCE=/home/users/dhertwig/jules/vn5.2_t610_fix_Sw
fcm make -f  /work/scratch/dhertwig/cylc-run/u-ba499/work/1/fcm_make/fcm-make.cfg 

I note that your suite is currently trying to compile your JULES code on the LOTUS batch nodes instead of
with the background option (which is the same as doing it interactively). One suite that I help to manage (u-al752) has these options set in ~pmcguire/roses/u-al752/site/suite.rc.CEDA_JASMIN :

    [[JASMIN_BACKGROUND]]
        inherit = None, JASMIN

        [[[job]]]
            batch system = background

    [[FCM_MAKE_CEDA_JASMIN]]
        inherit = None, JASMIN_BACKGROUND

        [[[environment]]]
            JULES_BUILD=normal
            JULES_OMP=noomp
            JULES_PLATFORM=jasmin-lotus-intel

Maybe by switching the batch system from lsf (i.e from LOTUS) to background, then that could fix things? But you should only do this for the compile (fcm_make), not for running JULES.
Patrick

comment:7 Changed 7 weeks ago by pmcguire

Hi Patrick
Great that you were able to compile it!

I've included these changes in my /home/users/dhertwig/roses/u-ba499/suite.rc file now, but I cannot start the suite with "rose suite-run" at the moment. It complains it is running but I cannot shut it down. I'll try to test this on a test copy of the suite instead.
Denise

comment:8 Changed 7 weeks ago by pmcguire

Hi Patrick
I cannot get this to work with the adjustments in /home/users/dhertwig/roses/u-bk557/suite.rc

Just to make sure I got this right, after making the changes (fcm-make in background) I still start the suite as usual with "rose suite-run"?

I get the following error:

[dhertwig@jasmin-sci1 u-bk557]$ rose suite-run
[INFO] export CYLC_VERSION=7.8.1
[INFO] export ROSE_ORIG_HOST=jasmin-sci1.ceda.ac.uk
[INFO] export ROSE_SITE=
[INFO] export ROSE_VERSION=2019.01.0
[INFO] create: log.20190709T135333Z
[INFO] delete: log
[INFO] symlink: log.20190709T135333Z <= log
[INFO] log.20190709T135142Z.tar.gz <= log.20190709T135142Z
[INFO] delete: log.20190709T135142Z/
[INFO] create: log/suite
[INFO] create: log/rose-conf
[INFO] symlink: rose-conf/20190709T145333-run.conf <= log/rose-suite-run.conf
[INFO] symlink: rose-conf/20190709T145333-run.version <= log/rose-suite-run.version
[INFO] delete: suite.rc
[INFO] install: suite.rc
[INFO] REGISTERED u-bk557 -> /home/users/dhertwig/cylc-run/u-bk557
[FAIL] cylc validate -o /tmp/tmpaHuHQ6 --strict u-bk557 # return-code=1, stderr=
[FAIL] WARNING - deprecated items were automatically upgraded in 'suite definition':
[FAIL] WARNING -  * (6.11.0) [runtime][JASMIN][submission polling intervals] -> [runtime][JASMIN][job][submission polling intervals] - value unchanged
[FAIL] WARNING -  * (6.11.0) [runtime][JASMIN][execution polling intervals] -> [runtime][JASMIN][job][execution polling intervals] - value unchanged
[FAIL] WARNING - naked dummy tasks detected (no entry under [runtime]):
[FAIL]  +       fcm_make
[FAIL] 'ERROR: strict validation fails naked dummy tasks'

Denise

comment:9 Changed 7 weeks ago by pmcguire

Hi Denise:
I can get your suite to finish the fcm_make step successfully.

I followed these steps:
1) I copied your suite:
cp -pr ~dhertwig/roses/u-bk557 pmcguire/roses/u-bk557dhertwig

2) I changed the output directory from yours in the jules gws to mine on scratch:

[namelist:jules_output]
output_dir='/work/scratch/pmcguire/u-bk557/temp'

3) I changed your suite.rc file to have:
[[fcm_make]]
instead of:
[[fcm-make]]

4) I ran rose suite-run from jasmin-cylc instead of jasmin-sci1.
We should be doing this routinely. One issue is that the cylc GUI no longer pops up on jasmin-sci1.
You can see instructions for how to configure your files on your local machine and on JASMIN in order to easily run on jasmin-cylc at:
https://research.reading.ac.uk/landsurfaceprocesses/software-examples/tutorial-rose-cylc-jules-on-jasmin/

Does this work for you?
Patrick

comment:10 follow-up: Changed 7 weeks ago by pmcguire

Patrick:
It's working now! Sorry I missed that "-" v "_" typo … but that seemed to have been the problem. I'll adjust the suite now to run/start from a dump file and will let you know if it works.
Denise Hertwig

comment:11 Changed 7 weeks ago by pmcguire

Hi Patrick,

The start-from-dump suite (/home/users/dhertwig/roses/u-bk557) is now running without a problem on jasmin-cylc! — So it must have been the running fcm_make in the background that caused the problem.

I compared the previous output (starting from scratch with spinup) to the new start-from-previous-dump now. For the same scenario, the output binaries are not identical (according to $diff), but the max and mean differences are small:

MAX DIFF, MEAN DIFF
Kup: 0.2928009, 0.00081858557
Lup: 0.69711304, 0.0010061894
QH: 5.2752686, 0.020731201
QE: 6.571204, -0.022658318
QN: 0.31533813, -0.001824752

Do we expect the output to be exactly the same / bit comparable when starting from a previous dump instead of repeating the spinup?

Many thanks for your help in solving this problem!!

Denise

comment:12 Changed 7 weeks ago by pmcguire

Hi Denise:
You're welcome! I am glad it now runs.
It could indeed be that running fcm_make as 'background' instead of on LOTUS solves the problem.
I would however, in this case, expect the outputs to be bit comparable after starting from a dump, if all the namelist settings are otherwise the same.
Patrick

comment:13 Changed 6 weeks ago by pmcguire

  • Cc dhertwig added; d.hertwig@… removed

Hi Patrick,

I’m still puzzled that the output from the restarted runs is not bit-comparable with the previous output as I have changed nothing in the namelists that was not related to the start-from-dump aspect. Maybe this could have to due with truncation errors/uncertainties in the initial conditions? I.e. the values in the dump files are written with a different precision compared to the initial state that is used by the model directly from the spin-up.

Cheers,
Denise

comment:14 Changed 6 weeks ago by pmcguire

Hi Denise
Can you quote for me the changes you made to your suite in order to start from the dump?
And can you tell me what the output files are for when you start from the dump and when you were running without restarting from the dump?
Patrick

comment:15 Changed 6 weeks ago by dhertwig

Hi Patrick,

I changed the namelist namelist:jules_spinup from

max_spinup_cycles=10
nvars=2
spinup_end='2012-01-01 00:00:00'
spinup_start='2011-01-01 01:00:00'
terminate_on_spinup_fail=.false.
tolerance=1.0,0.2
use_percent=.true.,.false.
var='smcl','t_soil'

to

max_spinup_cycles=0

and the namelist namelist:jules_initial from

const_val=0.0,276.78,12.1,0.0,50.0,0.759,3.0,0.0,0.0,0.0,0.0
dump_file=.false.
file='initial_conditions_phenology.dat'
nvars=11
total_snow=.true.
use_file=7*.false.,.true.,.true.,.true.,.true.
var='canopy','tstar_tile','cs','gs','rgrain','sthzw','zw',
   ='sthuf','t_soil','snow_tile','lai'
var_name=11*''

to

dump_file=.true.
file='/group_workspaces/jasmin2/jules/dhertwig/u-ba499/116255/jules.dump.20110101.3600.nc'
total_snow=.true.

These are the locations of the output files:

(1) Continuation from spin-up (fcm_make in background):

/group_workspaces/jasmin2/jules/dhertwig/u-bk557/temp/new_compile_spin/jules.hourly_snapshot.nc

—> This output is bit comparable with /group_workspaces/jasmin2/jules/dhertwig/u-ba499/116255/jules.hourly_snapshot.nc (fcm_make on lsf system; continuation from spinup)

(2) Restart from dump file (fcm_make in background):

/group_workspaces/jasmin2/jules/dhertwig/u-bk557/temp/new_compile_dump_jules.dump.20110101.3600/jules.hourly_snapshot.nc

Cheers,
Denise

comment:16 Changed 6 weeks ago by dhertwig

I asked Maggie about whether we can expect the output to be bit comparable. This is what she replied:


"So to clarify the two runs that you have compared are the output started as a continuation of the spin up (i.e. the original output) and the run started from the spun up dump for that run, so in a perfect scenario with infinite precision we would expect them to be identical.

I'm pretty sure that JULES from a start dump (CRUN in UM speak) is not bit comparable to JULES standalone as a continuation from the spin-up (NRUN in UM speak) in the same way as you can have in the UM with the bit reproducible code/build. I think there are always rounding differences when written to the dump as you've said. However, I don't know how much the perturbation grows. The JULES online documentation (https://jules-lsm.github.io/latest/output.html#dump-files) says that there are dumps:

  1. After initialisation is complete, immediately before the start of the run (initial state).
  2. Before starting the main run.

I know there are more, but these are the ones that might be useful here. I'm not sure how these two differ, but as a sanity check you could compare the NRUN-A & CRUN-A and NRUN-B & CRUN-B to see if there is a noticeable difference. "


Based on the dump files I have from the run, I don't think I have the two Maggie lists. There are

jules.dump.20110101.3600.nc  ---> start of main run (that's what I used to re-start)
jules.dump.20120101.0.nc ---> start of year 1
jules.dump.20130101.0.nc ---> start of year 2
jules.dump.20140101.0.nc ---> start of year 3

plus the spinup dumps (e.g. in /group_workspaces/jasmin2/jules/dhertwig/u-bk557/temp/new_compile_spin).

comment:17 in reply to: ↑ 10 Changed 5 weeks ago by dhertwig

Replying to pmcguire:

Patrick:
It's working now! Sorry I missed that "-" v "_" typo … but that seemed to have been the problem. I'll adjust the suite now to run/start from a dump file and will let you know if it works.
Denise Hertwig


NB:

Need to remove the /home/users/<username>/cylc-run/<suite-id> and /work/scratch/<username>/cylc-run/<suite-id> directories if the suite had already been compiled before, otherwise running in background does not work (same bad-namespace error as previously)

Note: See TracTickets for help on using tickets.