job failed (xhgzd)

Keywords: hand edit, tic code, checksum error, climate mean, OT18: DIATOM-CHLORO; CFC12; EXTRA-C
My job (xhgzd) is failed to run. I cannot figure it out what is wrong. I cannot find any explicit error message from .leave file, rather some information like:

/work/n02/n02/dh023729/xhgzd/bin/qsmaster: Failed in qsmaster in model xhgzd

Starting script : qsfinal
Starting time : Wed Sep 5 17:15:43 UTC 2012


/work/n02/n02/dh023729/xhgzd/bin/qsfinal: Model xhgzd - Error: No history files

Ending script : qsfinal
Completion code : 135
Completion time : Wed Sep 5 17:15:43 UTC 2012


/work/n02/n02/dh023729/xhgzd/bin/qsmaster: failed in final in model xhgzd


qsexecute: %MODEL% output follows:-

qsexecute : error loadmodule /work/n02/n02/dh023729/xhgzd/bin/xhgzd.exe not found or has wrong permissions
0+1 records in
0+1 records out
220664 bytes (221 kB) copied, 0.00134958 s, 164 MB/s

The .leaver is accessible and is located at:

Besides, I read a previous ticket, whose problem has been solved by changing the $UMSETUP to hg6.6.3. I checked my $UMSETUP in .profile, I'm still using vn6.1. Could this be the reason why my job failed?


Hi Liang,

You have selected "run from existing executable" in the compile options for the model. This executable does not exist. There are two ways forward: either point to an existing exectutable, or select "compile and build the executable name below, then run".



I changed the option in UMUI into "compile and build the execuable name below, then run".

And the job cannot pass the compilation. And the .leave file:



Hi Liang,

The job uses a hand_edit, hector_q, that modifies the tic code. You need to take a copy of this and modify it to match your own tic code. It is clear that the original job could not have worked as it stands, so you might want to check this too.



I copied the hand_edit file 'hector_q' to my own directory, and changed the tic code to my own. But, the compilation was failed again. The .leave file is located:

Any idea about this?


Hi Liang,

Because the previous compile failed, it has left the file


on HECToR. You need to delete this first and then try again.



Thank you, Willie. The model now can pass the compilation, but fails as it runs.

It looks like the model failed to read in some ancillary files. The .leave file is in:

Any idea that how to fix this?


Hi Liang,

If you look in the .leave file, you will see,

lib-5016 : UNRECOVERABLE library error 
  An EOF or EOD has been encountered unexpectedly.

Encountered during a sequential unformatted READ from unit 58
Fortran unit 58 is connected to a sequential unformatted  file:

So there is a problem with this file.



I replaced those UKCA ancillary files. But the model went wrong due to some diagnostic error.

UM ERROR (Model aborting) :
Routine generating error: U_MODEL
Error code: 4
Error message:

ACUMPS: Diagnostic error. See output for item no.


Rank 0 [Mon Sep 24 15:22:08 2012] [c8-1c0s5n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0
_pmiu_daemon(SIGCHLD): [NID 00939] [c8-1c0s5n3] [Mon Sep 24 15:22:08 2012] PE RANK 0 exit signal Aborted
[NID 00939] 2012-09-24 15:22:08 Apid 2774064: initiated application termination
diff: /work/n02/n02/dh023729/tmp/tmp.hector-xe6-13.7185/xhgzd.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/dh023729/xhgzd/xhgzd.thist to backup thist file /work/n02/n02/dh023729/xhgzd/xhgzd.thist_keep
xhgzd: Run failed

The .leave file locates at:

Any idea? Thanks.

Hi Liang,

In the .leave file there is the error

 ERROR: checksum failure in climate mean
 Section  0  item  120
 This can be due to invalid values in field, or corruption of partial sum file
 Remove or fix diagnostic, and rerun

This item , "OT18: DIATOM-CHLORO; CFC12; EXTRA-C ", is present in your oceasn STASH. This is known to cause problems on HECToR, so you need to switch it off. This is probably causing the other errors.



The model is finally running, thanks. But, as the model stopped because of disk quota exceeded, I had problem to resubmitted a continue run.

I modified the SUBMIT file in /home/dh023729/umui_jobs/xhgzd on PUMA.
Change STEP=2 to STEP=4

But, the submission failed, the error information is:
Calling FCM_MAIN_SCR - local…
(This may take several minutes.)

FCM_MAIN: Calling Extract …
Base extract: OK
Model extract: OK
Reconfiguration extract: OK
FCM_MAIN: Extract OK

FCM_MAIN: Submitting umuisubmit_clr …
qsub: script file:: No such file or directory
FCM_MAIN: Submit failed

Cray PrgEnv? already loaded

Any idea?
Thanks very much,


Hi Liang,

I've not used continuations myself - you can find instructions at How do I do a UM continuation run. I suspect that you might need to specify the last good dump to start from.

Other points are,

  • The last four leave files indicate "Global Net CO2 Flux into ocean - 2nd C NaN", so you may like to investigate this.
  • it may be useful to switch archiving on - see Rough Guide to UM archiving



