Opened 8 years ago

Closed 7 years ago

#906 closed help (fixed)

job failed (xhgzd)

Reported by: dh023729 Owned by: willie
Component: UM Model Keywords: hand edit, tic code, checksum error, climate mean, OT18: DIATOM-CHLORO; CFC12; EXTRA-C
Cc: Platform: HECToR
UM Version: 6.6.3

Description

Hello,

My job (xhgzd) is failed to run. I cannot figure it out what is wrong. I cannot find any explicit error message from .leave file, rather some information like:

/work/n02/n02/dh023729/xhgzd/bin/qsmaster: Failed in qsmaster in model xhgzd
*

Starting script : qsfinal
Starting time : Wed Sep 5 17:15:43 UTC 2012

*

/work/n02/n02/dh023729/xhgzd/bin/qsfinal: Model xhgzd - Error: No history files
*

Ending script : qsfinal
Completion code : 135
Completion time : Wed Sep 5 17:15:43 UTC 2012

*

/work/n02/n02/dh023729/xhgzd/bin/qsmaster: failed in final in model xhgzd

And:

qsexecute: %MODEL% output follows:-

qsexecute : error loadmodule /work/n02/n02/dh023729/xhgzd/bin/xhgzd.exe not found or has wrong permissions
0+1 records in
0+1 records out
220664 bytes (221 kB) copied, 0.00134958 s, 164 MB/s

The .leaver is accessible and is located at:
/home/n02/n02/dh023729/um/umui_out/xhgzd000.xhgzd.d12249.t181042.leave

Besides, I read a previous ticket, whose problem has been solved by changing the $UMSETUP to hg6.6.3. I checked my $UMSETUP in .profile, I'm still using vn6.1. Could this be the reason why my job failed?

Thanks,
Liang

Change History (12)

comment:1 Changed 8 years ago by willie

Hi Liang,

You have selected "run from existing executable" in the compile options for the model. This executable does not exist. There are two ways forward: either point to an existing exectutable, or select "compile and build the executable name below, then run".

Regards

Willie

comment:2 Changed 8 years ago by dh023729

Hi,

I changed the option in UMUI into "compile and build the execuable name below, then run".

And the job cannot pass the compilation. And the .leave file:

/home/n02/n02/dh023729/um/umui_out/xhgzd000.xhgzd.d12257.t165657.comp.leave

Thanks,
Liang

comment:3 Changed 8 years ago by willie

  • Keywords hand edit, tic code added
  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Liang,

The job uses a hand_edit, hector_q, that modifies the tic code. You need to take a copy of this and modify it to match your own tic code. It is clear that the original job could not have worked as it stands, so you might want to check this too.

Regards,

Willie

comment:4 Changed 8 years ago by dh023729

Hi,

I copied the hand_edit file 'hector_q' to my own directory, and changed the tic code to my own. But, the compilation was failed again. The .leave file is located:
/home/n02/n02/dh023729/um/umui_out/xhgzd000.xhgzd.d12260.t235826.comp.leave

Any idea about this?

Thanks,
Liang

comment:5 Changed 8 years ago by willie

Hi Liang,

Because the previous compile failed, it has left the file

/home/n02/n02/dh023729/xhgzd/ummodel/fcm.bld.lock

on HECToR. You need to delete this first and then try again.

regards,

Willie

comment:6 Changed 8 years ago by dh023729

Hi,

Thank you, Willie. The model now can pass the compilation, but fails as it runs.

It looks like the model failed to read in some ancillary files. The .leave file is in:
/home/n02/n02/dh023729/um/umui_out/xhgzd000.xhgzd.d12261.t153107.leave

Any idea that how to fix this?

Thanks,
Liang

comment:7 Changed 8 years ago by willie

Hi Liang,

If you look in the .leave file, you will see,

lib-5016 : UNRECOVERABLE library error 
  An EOF or EOD has been encountered unexpectedly.

Encountered during a sequential unformatted READ from unit 58
Fortran unit 58 is connected to a sequential unformatted  file:
  "/work/n02/n02/odarbysh/HG2_ancils/UKCA/tropdata/photol/jmhp.bin"

So there is a problem with this file.

Regards,

Willie

comment:8 Changed 8 years ago by dh023729

Hi,

I replaced those UKCA ancillary files. But the model went wrong due to some diagnostic error.

*
UM ERROR (Model aborting) :
Routine generating error: U_MODEL
Error code: 4
Error message:

ACUMPS: Diagnostic error. See output for item no.

*

Rank 0 [Mon Sep 24 15:22:08 2012] [c8-1c0s5n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0
_pmiu_daemon(SIGCHLD): [NID 00939] [c8-1c0s5n3] [Mon Sep 24 15:22:08 2012] PE RANK 0 exit signal Aborted
[NID 00939] 2012-09-24 15:22:08 Apid 2774064: initiated application termination
diff: /work/n02/n02/dh023729/tmp/tmp.hector-xe6-13.7185/xhgzd.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/dh023729/xhgzd/xhgzd.thist to backup thist file /work/n02/n02/dh023729/xhgzd/xhgzd.thist_keep
xhgzd: Run failed

The .leave file locates at:
/home/n02/n02/dh023729/um/umui_out/xhgzd000.xhgzd.d12268.t144224.leave

Any idea? Thanks.
Liang

comment:9 Changed 8 years ago by willie

  • Keywords code, checksum error, climate mean, OT18: DIATOM-CHLORO; CFC12; EXTRA-C added; code removed

Hi Liang,

In the .leave file there is the error

 ERROR: checksum failure in climate mean
 Section  0  item  120
 This can be due to invalid values in field, or corruption of partial sum file
 Remove or fix diagnostic, and rerun

This item , "OT18: DIATOM-CHLORO; CFC12; EXTRA-C ", is present in your oceasn STASH. This is known to cause problems on HECToR, so you need to switch it off. This is probably causing the other errors.

Regards,

Willie

comment:10 Changed 8 years ago by dh023729

Hi,

The model is finally running, thanks. But, as the model stopped because of disk quota exceeded, I had problem to resubmitted a continue run.

I modified the SUBMIT file in /home/dh023729/umui_jobs/xhgzd on PUMA.
Change TYPE=NRUN to TYPE=CRUN
Change STEP=2 to STEP=4

But, the submission failed, the error information is:
Calling FCM_MAIN_SCR - local…
(This may take several minutes.)

FCM_MAIN: Calling Extract …
Base extract: OK
Model extract: OK
Reconfiguration extract: OK
FCM_MAIN: Extract OK

FCM_MAIN: Submitting umuisubmit_clr …
qsub: script file:: No such file or directory
FCM_MAIN: Submit failed

Cray PrgEnv? already loaded

Any idea?
Thanks very much,

Liang

comment:11 Changed 7 years ago by willie

Hi Liang,

I've not used continuations myself - you can find instructions at How do I do a UM continuation run. I suspect that you might need to specify the last good dump to start from.

Other points are,

  • The last four leave files indicate "Global Net CO2 Flux into ocean - 2nd C NaN", so you may like to investigate this.
  • it may be useful to switch archiving on - see Rough Guide to UM archiving

Regards,

Willie

comment:12 Changed 7 years ago by willie

  • Platform set to HECToR
  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.