Opened 22 months ago

Closed 21 months ago

Last modified 18 months ago

#1719 closed help (fixed)

Error with CRUN, NEMO restart dump no consistent with the UM .xhist file

Reported by: dilshadshawki Owned by: annette
Priority: normal Component: UM Model
Keywords: NEMO, restart dump, CRUN, Cc:
Platform: MONSooN UM Version: 8.2

Description

Hello Helpdesk,

I have been trying to run a job xlzcg and I have managed to run an NRUN with the output being created successfully. But then when I attempt to run a CRUN I get the following message in the .leave file:

/home/dshawk/otuput/xlzcg000.xlzcg.d15308.t145403.leave
ERROR: The latest NEMO restart dump does not seem to be
       consistent with the UM .xhist file
       This suggests an untidy model failure but you
       may be able to retrieve the run by copying the
       backup dump and xhist files to original location
       and restarting with the appropriate NEMO dump

There is also a .comp.leave which shouldn't have been produced since with the CRUN I am not compiling anymore?

/home/dshawk/output/xlzcg000.xlzcg.d15308.t145403.comp.leave

Earlier, I tried to run this job using the restart dumps produced by the run in the projects folder: /projects/ukca-imp/dshawk/xlzcg

However it produces ice restart files which begin from 02 (February) even though the ice restart dump I used begins in 09 (September). This may have something to do with needing to reset using modify_CICE_header but this does not exist anymore since moving to the new Cray system and instead use the following fix as instructed from ticket #1667

UM: fcm:um-br/jwalton/vn8.2_NEMOCICE_restart_fixes_UKMO
CICE: fcm:cice-br/dev/jwalton/vn4.1m1_restart_date_fix_UKMO

Thought this may be a side issue. In any case I tried to rerun with the restart dumps produced by the run:
/home/dshawk/startdumps/xlzcg/
xlzcga.da21011201_00
xlzcgo_21011201_restart_0000.nc
anqdhi.restart.1999-12-01-00000

I use a different a CICE restart dump (not produced by the run) in order to maintain the same month (12) but then this gave me a different error, so I thought I better not mess things up too much and just try to rerun everything from scratch with new start dumps:
/home/dshawk/startdumps/xlzcg/
xkhqaa.da23000901_00
xkhqao_23000901_restart.nc
xkhqai.restart.2300-09-01-00000

but once again I have returned to the same problem in the .leave shown above.

Your help would be very much appreciated.

Cheers,
Dill

Change History (14)

comment:1 Changed 22 months ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to assigned

Hi Dill,

The directory: /projects/ukca-imp/dshawk/xlzcg does not exist on XCM anymore. Have you moved this elsewhere? As you've found the information in the leave file is pretty minimal so I can't tell what's going on without looking at the job output and namelists. It also looks like you have edited the UMUI job since submitting.

As for the compile-CRUN issue, if you look in the comp.leave file it is only trying to build the UM scripts, which it will allow you to do with a CRUN. To switch off the script build, go to Compile and Run Options → UM Scripts Build and deselect "Enable build of UM scripts".

As far as I know, the branches I mentioned in #1667 should work. I haven't tested them myself but they are being used in other coupled jobs. If you are having trouble with this, use a separate UMUI job, and point me to the leave files so that I can see exactly what is going on.

Best regards,
Annette

comment:2 Changed 22 months ago by dilshadshawki

Hi Annette,

Please allow me to clarify the issue:

First of all, it is very strange that the xlczg folder didn't exist!

What I did was run the job, it does the NRUN fine, and that's how I know the branches to replace the modify_CICE_header don't cause any initial problems. But the model outputs CICE startdumps from January (as if the original restartdump was from December) but it should be outputting startdumps from October (since the original restartdump was actually from September).

I went on to do a CRUN and I get the error:

ERROR: The latest NEMO restart dump does not seem to be
       consistent with the UM .xhist file
       This suggests an untidy model failure but you
       may be able to retrieve the run by copying the
       backup dump and xhist files to original location
       and restarting with the appropriate NEMO dump

And I think that this error is caused by the CICE startdump since it is in the wrong season.

Now, there is a folder xlzcg here:

/projects/ukca-imp/dshawk/xlzcg

You should now be able to see the output and namelists etc.

Hope this clarifies things!

Cheers,
Dill

comment:3 Changed 22 months ago by annette

Hi Dill,

Thanks for clarifying. I noticed that you hadn't included the UM branch in your job:

fcm:um-br/dev/jwalton/vn8.2_NEMOCICE_restart_fixes_UKMO

There was a typo in the name of this branch in #1667 which I have corrected.

However, I have tested this in a copy of your job and CICE still seems to have the wrong date. I will look into this, but in the meantime I have re-built modify_CICE_header for XCM:

~aospre/bin/modify_CICE_header

This version asks for the value of oceanmixed_ice. Try specifying T first, and if it tries to read past the end of the file, then set to F.

Also, I noticed that the archiving scripts for NEMO/CICE you have included in the job are failing. Do you want some help with that? I have versions for the vn7.3 coupled model which work but they might be different in some way.

Annette

comment:4 Changed 22 months ago by dilshadshawki

Hi Annette,

Thanks for looking into this, please let me know when you find out more.

These are the things I've tried:

1.I ran the job with the CICE reset using the modify_CICE_header, but my branch remained as

fcm:/um/branches/dev/jwalton/vn8.2_NEMOCICE_restart_fixes_UKMO

and not this:

fcm:um-br/dev/jwalton/vn8.2_NEMOCICE_restart_fixes_UKMO

And the CICE files outputted by the job still do not have the correct dates.

  1. I then tried changing the branch to fcm:um-br/dev… (as in your one above) but I got an error during the compilation mentioning something like:
[FAIL] /projects/ukca-imp/dshawk/xlzcg/baserepos/JULES: cannot locate config file, abort at /work/home/fcm/fcm-2015.05.0/bin/../lib/FCM1/ConfigSystem.pm line 539

But after changing the CICE restart dump from the one I reset using to modify_CICE_header to the one I did not reset then this error goes away, but the same thing happens (as you have found) that the CICE dates are still incorrect, although the branch was correct.

I will wait to hear back from you on this issue?

As for the NEMO and CICE archiving scripts, are you referring to the ones specified in
Atmosphere → Control → Postprocessing, Dumping and Meaning - User Script Release?

and the files are:

/home/dshawk/hadgem3_scripts

In any case, yes please could you help with me that?

Many thanks,
Dill

comment:5 Changed 21 months ago by dilshadshawki

Hi Annette,

Any news regarding the CICE issue and the others above?

Cheers,
Dill

comment:6 Changed 21 months ago by annette

Dill,

I will work on fixing the post-processing scripts this afternoon.

Sorry again I'm slightly confused about what does and doesn't work for you… If you run without the new branches but using the CICE restart edited with modify_CICE_header does that work? This works for me. This is the solution I'd recommend for now.

Annette

comment:7 Changed 21 months ago by dilshadshawki

Hi Annette,

You are correct, not including the new branches and using the CICE restart edited with modify_CICE_header does work for me! Now the outputted CICE files and dumps have the correct dates.

But then I still get the original error as before, which I thought might have been because of the CICE issue. There must be something else going on.

Do you know what the reason behind this error could be?:

ERROR: The latest NEMO restart dump does not seem to be
       consistent with the UM .xhist file
       This suggests an untidy model failure but you
       may be able to retrieve the run by copying the
       backup dump and xhist files to original location
       and restarting with the appropriate NEMO dump
ERROR: Expected NEMO output files are not all available.
       This may be a UM / OASIS / NEMO start-up problem.
       The ocean.output file may provide more information.

/home/dshawk/output/xlzcg000.xlzcg.d15321.t161520.leave

Best wishes,
Dill

comment:8 Changed 21 months ago by annette

Hi Dill,

This is an issue with the scripts not reading the .xhist file properly. Please include the following branch in the "FCM Options for UM…" window, under "Use Central Script Modifications":

fcm:um_br/dev/annette/vn8.2_NEMOCICE_restart_fixes_XC40_ukmo/src

Annette

comment:9 Changed 21 months ago by annette

Btw, I think you should be able to just build the scripts with the CRUN on without re-doing the NRUN…

Annette

comment:10 Changed 21 months ago by annette

Hi Dill,

As for the post-processing scripts, there are Cray versions available here (note I haven't tested these):

/projects/ocean/hadgem3/scripts/GC2.0_XC40

So either you can use these ones, or just change the executables at the top of your cice_mean.sh and nemo_mean.sh files to use:

nrebuild=/projects/ocean/nemo/utils/bin/rebuild_nemo
ncavg=/projects/ocean/nemo/nco/nco-default/bin/ncra
mean_trnd3d=/projects/ocean/hadgem3/scripts/GC2.0_XC40/mean_nemo.exe

Hope this helps,

Annette

comment:11 Changed 21 months ago by dilshadshawki

Hi Annette,

You have been a star and my hero. I managed to run the job for a year, finally!

So all is fixed, I hope! Let's close this ticket shall we?

Many thanks again for all of your help!

Best,
Dill

comment:12 Changed 21 months ago by annette

  • Resolution set to fixed
  • Status changed from assigned to closed

Brilliant! Thanks for letting us know.

Annette

comment:13 Changed 18 months ago by dilshadshawki

Hi Annette,

For some reason, the modify_CICE_header tool gives me the error message 'Illegal instruction' and this happens after I enter the new value for time. It has been working fine up until now, but for some reason it no longer works :-( Do you know what could be going on? See below the procedure I have always taken but now with the error message:

[dshawk@exppostproc01:~/startdumps/xlzcj]$ /home/aospre/bin/modify_CICE_header
 Enter input file name:
xlzcji.restart.2321-09-01-00000
 Enter output file name:
xlzcji.restart.2321-09-01-00000_reset

 Please enter the name of the grid used in this file
 choose from ORCA2, ORCA1, ORCA025, CUSTOM (note case)
ORCA1

 Enter the value of oceanmixed_ice 'T' or 'F'
T
 oceanmixed_ice = .true.

 Time header variables:
 istep0:  183360
 time:  518400000.
 time_forc:  0.

 Enter new value for istep0 [N for no change]:
0
 Enter new value for time [N for no change]:
23328000
Illegal instruction


Many thanks,
Dill

comment:14 Changed 18 months ago by annette

comment:13 moved to new ticket #1823

Annette

Note: See TracTickets for help on using tickets.