Opened 9 years ago

Closed 8 years ago

#750 closed help (fixed)

xghkz repeat of old job (xghkc) now crashing

Reported by: kjp Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: 6.1

Description

I am rerunning a job (xghkz v xghkc) using the UM6.1 on Hector 2b but it now crashes during the reconfiguration stage with a message about file access from an obscure library that has me baffled. I will attempt to attach the .leave file but the relevant lines seem to be

/work/n02/n02/hum/vn6.1/pathscale/scripts/qssetup: Job terminated normally

/work/n02/n02/kjp/tmp/tmp.hector-xe6-14.31448/modscr_xghkl/qsexecute: Executing
dump reconfiguration program /work/n02/n02/kjp/xghkl/reconf.exe

lib-4324 : UNRECOVERABLE library error

The variable name 'INTERNAL_MODEL_LIST' is unrecognized in namelist input.

Encountered during a namelist READ from unit 10
Fortran unit 10 is connected to a sequential formatted text file
lib-4324 : UNRECOVERABLE library error

The variable name 'INTERNAL_MODEL_LIST' is unrecognized in namelist input.

:

"/work/n02/n02/kjp/tmp/tmp.hector-xe6-14.31448/xghkl.recona"

Encountered during a namelist READ from
lib-4324 : UNRECOVERABLE library error

The variable name 'INTERNAL_MODEL_LIST' is unrecognized in namelist input.

Attachments (1)

xghkl.fail (28.1 KB) - added by kjp 9 years ago.
xghkl.leave file: it differs only in the STASH specification, crashes in the same way

Download all attachments as: .zip

Change History (11)

Changed 9 years ago by kjp

xghkl.leave file: it differs only in the STASH specification, crashes in the same way

comment:1 Changed 9 years ago by willie

Hi,

Please note that HECToR is being upgraded at the moment. Changes that users need to make can be found on the web page at

http://cms.ncas.ac.uk/index.php/component/content/article/22/1583-hectorphase3

Regards,

Willie

comment:2 Changed 9 years ago by kjp

I have changed my .profile to TARGET_MC=cce and the machine name to phase3 (correct as of this morning?) but I now fail in the reconfiguration stage with

gc_abort (Processor 0 ): Cannot set GC_FORCE_BITREP - value unrecognised
gc_abort (Processor 1 ): Cannot set GC_FORCE_BITREP - value unrecognised
gc_abort (Processor 17 ): Cannot set GC_FORCE_BITREP - value unrecognised…etc

I presume this flag is to force bit reproducibility with future runs under this compiler.
This failure also happens when running a global job from a Met Office start dump not just a LAM.

Sorry if I should have opened a new ticket.

comment:3 Changed 9 years ago by jeff

Hi

There is an extra mod needed now for 6.1 on hector, I've added it to the standard f77 pum mod, pum_full_6.1.mf77. If you recompile both the reconfiguration and UM it should hopefully get past this problem.

Jeff.

comment:4 Changed 9 years ago by kjp

Unfortunately, the GC_FORCE_BITREP error is still there even with a recompile.

KJP

comment:5 Changed 9 years ago by jeff

I need to have a look at your output files, but can't as the permissions do not allow it. Can you run these commands to give me read access to your files

chmod -R g+rX /home/n02/n02/kjp
chmod -R g+rX /work/n02/n02/kjp

Jeff.

comment:6 Changed 9 years ago by kjp

Done!

Thanks for looking at this.

KJP

comment:7 Changed 9 years ago by jeff

Hi

It turned out that another mod for f90 code is also needed. I've now added this to the standard f90 PUM mod, pum_full_6.1.mf90. So if you recompile your code again hopefully it will work this time.

Jeff.

comment:8 Changed 9 years ago by kjp

Thanks, that seems to have successfully cleared up that problem. Unfortunately, while it gets through the reconfiguration stage, it doesn't seem to actually go on to run the UM itself but just exits!

/home/n02/n02/kjp/um/umui_out/xghkz000.xghkz.d11346.t140952.leave
/home/n02/n02/kjp/um/umui_out/xghkz000.xghkz.d11346.t140952.comp.leave

comment:9 Changed 9 years ago by jeff

In your leave file you have this error

aprun: -N cannot exceed -n

You have set -N set to 32 which is the number of cores per node, this used to be 24 on phase2b but has changed on the new system. -n is set to 24 this is the number of cores you have set in the umui to run your model with. As the error message says -n needs to be a least 32 so change the the processor configuration in panel "User Information and Target Machine → Target Machine" from 6x4 to 8x4.

Jeff.

comment:10 Changed 8 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.