Opened 11 years ago

Closed 11 years ago

#284 closed help (fixed)

error in nupdate after job transfer from hpcx

Reported by: alexrap Owned by: lois
Component: UM Model Keywords:
Cc: Platform:
UM Version:

Description

I have transferred one of my HPCX jobs to Hector and after changing the environment variables and submitting the job (xecol), I get this error in the .comp.leave file:


* alexrap Job: 302839.sdb starts: 02/06/09 14:55:13 node: nid00008 *
* alexrap Job: 302839.sdb starts: 02/06/09 14:55:13 node: nid00008 *
* alexrap Job: 302839.sdb starts: 02/06/09 14:55:13 node: nid00008 *
* alexrap Job: 302839.sdb starts: 02/06/09 14:55:13 node: nid00008 *

User may access requested budget


*

Version 6.1 template, Unified Model , Non-Operational
Created by UMUI version 6.1

*
PGI PrgEnv? already loaded
Currently Loaded Modulefiles:

1) modules/3.1.6 11) xt-pe/2.1.56HD
2) MySQL/5.0.45 12) xt-asyncpe/2.4
3) pbs/8.1.4 13) PrgEnv?-pgi/2.1.56HD
4) packages 14) xt-service/2.1.56HD
5) pgi/7.1.4 15) xt-libc/2.1.56HD
6) totalview-support/1.0.6 16) xt-os/2.1.56HD
7) xt-totalview/8.6.0 17) xt-boot/2.1.56HD
8) fftw/3.1.1 18) xt-lustre-ss/2.1.56HD_1.6.5
9) xt-libsci/10.3.2 19) xtpe-target-cnl

10) xt-mpt/3.1.0 20) Base-opts/2.1.56HD
Directory for modified UM scripts added to path

PATH USED = /opt/pgi/7.1.4/linux86-64/7.1/bin:/opt/xt-lustre-ss/2.1.56HD_1.6.5/usr/sbin:/opt/xt-lustre-ss/2.1.56HD_1.6.5/usr/bin:/opt/xt-boot/2.1.56HD/bin/snos64:/opt/xt-os/2.1.56HD/bin/snos64:/opt/xt-service/2.1.56HD/bin/snos64:/opt/xt-prgenv/2.1.56HD/bin:/opt/cray/xt-asyncpe/2.4/bin:/opt/xt-pe/2.1.56HD/bin/snos64:/opt/xt-pe/2.1.56HD/cnos/linux/64/bin:/opt/fftw/3.1.1/cnos/bin:/opt/toolworks/totalview.8.6.0/bin:/opt/totalview-support/1.0.6/bin:/opt/pbs/8.1.4/bin:/opt/MySQL/5.0.45/etc:/opt/MySQL/5.0.45/libexec:/opt/MySQL/5.0.45/bin:/opt/modules/3.1.6/bin:/home/n02/n02/alexrap/bin:/usr/local/bin:/usr/bin:/usr/X11R6/bin:/bin:/usr/games:/opt/gnome/bin:/opt/kde3/bin:/usr/lib/jvm/jre/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/pathscale/bin:.:/usr/lib/qt3/bin:/work/n02/n02/hum/vn6.1/pgi/utils:/work/n02/n02/hum/bin::/work/n02/n02/hum/vn6.1/bin:/work/n02/n02/hum/umcet/normal/bin:/work/n02/n02/hum/vn6.1/pgi/utils:/work/n02/n02/hum/bin::/work/n02/n02/hum/vn6.1/bin:/work/n02/n02/alexrap/tmp/tmp.nid00008.28584/modscr_xecol:/work/n02/n02/hum/vn6.1/pgi/scripts:/work/n02/n02/hum/vn6.1/pgi/exec
The following script modsets will be used:
$PUM_MODS61/pum_full_6.1_ksh_comp.mu
$PUM_MODS61/script_archfix.mu
$PUM_MODS61/um_archive61_hector.mu
$MY_SCRIPT_MODS/script_improve.mu
End of List

Completed with 9 error(s) and 10 warning(s).
updscripts: Error in nupdate command
updscripts: Nupdate command was :-
pumscm -p /work/n02/n02/hum/vn6.1/pgi/source/umsl -i /work/n02/n02/alexrap/tmp/tmp.nid00008.28584/xecol.updates -d -F -M


Resources requested: cput=01:00:00,mpparch=XT,mpphost=none,mppnppn=1,mppwidth=0,ncpus=1,place=pack
Resources allocated: cpupercent=0,cput=00:00:01,mem=2424kb,ncpus=1,vmem=26152kb,walltime=00:00:04

* alexrap Job: 302839.sdb ends: 02/06/09 14:55:17 queue: serial_1h *
* alexrap Job: 302839.sdb ends: 02/06/09 14:55:17 queue: serial_1h *
* alexrap Job: 302839.sdb ends: 02/06/09 14:55:17 queue: serial_1h *
* alexrap Job: 302839.sdb ends: 02/06/09 14:55:17 queue: serial_1h *


Any ideas on how this could be solved?

Thanks,
Alex.

Attachments (1)

xecol000.xecol.d09155.t164309.comp.leave (178.2 KB) - added by alexrap 11 years ago.

Download all attachments as: .zip

Change History (22)

comment:1 Changed 11 years ago by lois

  • Owner changed from um_support to lois
  • Status changed from new to assigned

Hello Alex,

when transferring your job from HPCx to HECToR there are certain mods that you must include. These are

script mods : $PUM_MODS61/pum_full_6.1.mu

Reconfiguration and model mods : $PUM_MODS61/pum_full_6.1.mc

$PUM_MODS61/pum_full_6.1.mf77
$PUM_MODS61/pum_full_6.1.mf90

Have you also checked that you have the following lines in your .profile on HECToR

UMDIR=/work/n02/n02/hum
#TARGET_MC=pgi
TARGET_MC=pathscale
#UMSETUP=$UMDIR/setvars_4.5 ; export UMSETUP
UMSETUP=$UMDIR/vn6.1/$TARGET_MC/scripts/.umsetvars_6.1 ; export UMSETUP
#UMSETUP=$UMDIR/vn7.1/$TARGET_MC/scripts/.umsetvars_7.1 ; export UMSETUP
if [ -f $UMSETUP ]
then

. $UMSETUP # set up UM environment variables

fi
loadcomp $TARGET_MC

Let me know if there are still problems .

Lois

comment:2 Changed 11 years ago by alexrap

Hello Lois,

I did all that and still get this error:

The following script modsets will be used:
$PUM_MODS61/pum_full_6.1_ksh_comp.mu
$PUM_MODS61/script_archfix.mu
$PUM_MODS61/um_archive61_hector.mu
$MY_SCRIPT_MODS/script_improve.mu
$PUM_MODS61/pum_full_6.1.mu
End of List

Completed with 84 error(s) and 48 warning(s).
updscripts: Error in nupdate command
updscripts: Nupdate command was :-
pumscm -p /work/n02/n02/hum/vn6.1/pathscale/source/umsl -i /work/n02/n02/alexrap/tmp/tmp.nid00011.7434/xecol.updates -d -F -M


Resources requested: cput=01:00:00,mpparch=XT,mpphost=none,mppnppn=1,mppwidth=0,ncpus=1,place=pack
Resources allocated: cpupercent=0,cput=00:00:01,mem=2308kb,ncpus=1,vmem=24580kb,walltime=00:00:03

* alexrap Job: 303245.sdb ends: 03/06/09 09:09:22 queue: serial_1h *
* alexrap Job: 303245.sdb ends: 03/06/09 09:09:22 queue: serial_1h *
* alexrap Job: 303245.sdb ends: 03/06/09 09:09:22 queue: serial_1h *
* alexrap Job: 303245.sdb ends: 03/06/09 09:09:22 queue: serial_1h *


"xecol000.xecol.d09154.t090859.comp.leave" 51L, 3695C

Alex.

comment:3 Changed 11 years ago by lois

Hello Alex,

I don't have permission to see your files on HECToR so it is diffcult to test my suggestions.

Looking at your job xecol you need to change the script mods in the UMUI

do not include $PUM_MODS61/pum_full_6.1_ksh_comp.mu

do not include $PUM_MODS61/script_archfix.mu

I can't tell whether you need $MY_SCRIPT_MODS/script_improve.mu

Leave in this mod I think $PUM_MODS61/um_archive61_hector.mu

This mod is essential $PUM_MODS61/pum_full_6.1.mu

This should get you further.

Lois

Changed 11 years ago by alexrap

comment:4 Changed 11 years ago by alexrap

Hi Lois,

I tried both with and without $MY_SCRIPT_MODS/script_improve.mu and the compilation failed in both cases. It goes a bit further though.

The .leave files are:
/home/n02/n02/alexrap/um/umui_out/xecol000.xecol.d09155.t163226.comp.leave with that script mod included

and

/home/n02/n02/alexrap/um/umui_out/xecol000.xecol.d09155.t164309.comp.leave without it.

Is there a way I could give you acces to my files on Hector? I though everyone should be able to read them.

Alex.

comment:5 Changed 11 years ago by lois

You need to allow group access to your files on HECToR, a simple way to do this is to add the line

umask 022

to your .profile file on HECToR.

Hopefully then I ccan look at your output files and see if we can get a bit further.

Lois

comment:6 Changed 11 years ago by alexrap

I've added it to my .profile.

comment:7 Changed 11 years ago by lois

Hello Alex, I still can't see your current files. Could you use the command

chmod -R g+rx .

in your hoem directory on HECToR.

Thanks
Lois

comment:8 Changed 11 years ago by alexrap

Hello Lois,

I've done it.

Alex.

comment:9 Changed 11 years ago by lois

Hello Alex,

your job now compiles and it is running. I don't know if it will actually work but this is as far as I have got today. I am on holiday next week.

The changes I made were to the mods

  • don't use $MODS61_GAM/gcg0n506.mf90 in the reconfiguration
  • don't use $HADGEM1A_MODS/aamdn602.mf90 N in the model
  • don't us$PUM_MODS61/pum_full_6.1.mf77 and $PUM_MODS61/fix_pol_fil.mf77

Could you change your job xecol similarly and let me know what happens?

Thanks

Lois

comment:10 Changed 11 years ago by alexrap

Hello Lois,

I changed my xecol job by removing those mods and it seems like it now compiled successfully.

However the run has stopped and the .leave file says that it's something wrong with the namelist file. I don't really know what to do about it. Could you please have a look on it.

Thanks,
Alex.

Here are some lines from the .leave file:

*

Job started at : Mon Jun 8 15:54:13 BST 2009
Run started from UMUI
Running from control files in /home/n02/n02/alexrap/umui_runs/xecol-159153740

xcrwh with IWM diagnosticated
This job is running on machine nid00004,
using UM directory /work/n02/n02/hum,
and test directory /work/n02/n02/hum/umtest.
*

Starting script : qsexecute
Starting time : Mon Jun 8 15:54:13 BST 2009

*

/work/n02/n02/alexrap/tmp/tmp.nid00004.3524/modscr_xecol/qsexecute: Executing setup

/work/n02/n02/hum/vn6.1/pathscale/scripts/qssetup: Job terminated normally

/work/n02/n02/alexrap/tmp/tmp.nid00004.3524/modscr_xecol/qsexecute: Executing dump reconfiguration program /work/n02/n02/alexrap/xecol/agodc.recon

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 8
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 4
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 9
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 6
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 5
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 1

*
ERROR!!! in reconfiguration in routine Rcf_Read_Namelists
Error Code:- 40
Error Message:- Vertical Levels Namelist file does not exist!
Error generated from processor 0
*

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 2
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 3
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 7
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 10
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 11
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 12
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 13
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 14
aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 15
[NID 13129]Apid 190588: initiated application termination
/work/n02/n02/alexrap/tmp/tmp.nid00004.3524/modscr_xecol/qsexecute: Error in dump reconfiguration - see OUTPUT
*

Ending script : qsexecute
Completion code : 137
Completion time : Mon Jun 8 15:54:15 BST 2009

*

/work/n02/n02/alexrap/tmp/tmp.nid00004.3524/modscr_xecol/qsmaster: Failed in qsexecute in model xecol
*

Starting script : qsfinal
Starting time : Mon Jun 8 15:54:15 BST 2009

*

/work/n02/n02/hum/vn6.1/pathscale/scripts/qsfinal: Model xecol - Error: No history files
*

Ending script : qsfinal
Completion code : 135
Completion time : Mon Jun 8 15:54:15 BST 2009

*

comment:11 Changed 11 years ago by lois

Sorry for delay due to holidays Alex.

In the UMUI if you go to the panel
atmosphere → model resolution → vertical

You will see that you have not changed to path for the file with the vertical levels. On HECToR the path should be

/work/n02/n02/hum/vn6.1/ctldata/vert

the file should be the same.

Lois

comment:12 Changed 11 years ago by alexrap

Hi Lois,

I did that and now the .leave file says:

*
ERROR!!! in reconfiguration in routine Rcf_Files_Init
Error Code:- 10
Error Message:- Failed to Open Start Dump
Error generated from processor 0
*
gc_abort (Processor 0 ): Job Aborted from Ereport

0+1 records in
0+1 records out
15535 bytes (16 kB) copied, 0.000619 seconds, 25.1 MB/s

Thanks,
Alex.

comment:13 Changed 11 years ago by alexrap

Hi Lois,

I looked a bit into the cause of that error and I changed the start dump directory, as that was definitely a problem. It is now /work/n02/n02/hum/vn6.1/dumps/n96_hadgem1. I hope this is right.

Anyway, I resubmitted the job and I now wait to see if it goes a bit further.

Alex.

comment:14 Changed 11 years ago by alexrap

Hi Lois,

The new error in the .leave file is:

Ancillary Files to be opened :

File No 2 Soil Moisture/Snow? Depth
File No 3 Soil Temperatures
File No 4 Soil Types
File No 9 Land Sea Mask
File No 10 Orography
File No 15 User Ancillary - Single
File No 16 User Ancillary - Multi
File No 17 Natural SO2 Emissions
File No 20 Initial fractions of surface types
File No 21 Initial vegetation state
File No 26 Land Fraction
File No 27 Dust Soil Properties

Ancillary File does not exist.
File : /home/n02/n02/alexrap/um_files/ancils/qrclim.slt_new
Got ancillary file name from Env Var DSOILTMP
Ancillary File does not exist.
File : /home/n02/n02/alexrap/um_files/ancils/qrparm.soil_ceh_n96
Got ancillary file name from Env Var SOILTYPE
Ancillary File does not exist.
File : /home/n02/n02/alexrap/um_files/ancils/qrparm.orog_new
Got ancillary file name from Env Var OROG
Ancillary File does not exist.
File : /home/n02/n02/alexrap/um_files/ancils/POM_FF_BF_2000.N96.ancil
Got ancillary file name from Env Var USRANCIL
Ancillary File does not exist.
File : /home/n02/n02/alexrap/um_files/ancils/biogen_n_distfln.N96
Got ancillary file name from Env Var USRMULTI
Ancillary File does not exist.
File : /home/n02/n02/alexrap/um_files/ancils/dust_N96_360cal
Got ancillary file name from Env Var DUSTSOIL
*
ERROR!!! in reconfiguration in routine Calc_nlookups
Error Code:- 10
Error Message:- Ancillary files have not been found - Check output for details
Error generated from processor 0
*
gc_abort (Processor 0 ): Job Aborted from Ereport

0+1 records in
0+1 records out
18569 bytes (19 kB) copied, 0.000636 seconds, 29.2 MB/s


I had a look and all those ancillary files are actually in that directory, namely /home/n02/n02/alexrap/um_files/ancils/, so I don't really know what causes this.

Thanks,
Alex.

comment:15 Changed 11 years ago by lois

On HECToR the batch job processors can only see the /work directory. Your ancillary files are all on the /home directory. If you just move the files and change the path of these files in the UMUI then you job should get further.

This is one of the major differences between HPCx and HECToR and it trips up many UM users.

Lois

comment:16 Changed 11 years ago by alexrap

Hi Lois,

I moved my ancillary files to the /work directory and submitted again. By the way, do I also have to move my script, model mods etc?

The error I'm getting now is:

Interpolating Field 65 ( Stashcode 150 ) W COMPNT OF WIND AFTER TIMESTEP

Setting w to zero, level 0
Setting w to zero, level 38

User Prognostic 66 ( Stashcode 151 ) RIVER SEQUENCE

*
ERROR!!! in reconfiguration in routine Rcf_Aux_File
Error Code:- 20
Error Message:- Dimensions of AUX file and dump file do not match
Error generated from processor 0
*
gc_abort (Processor 0 ): Job Aborted from Ereport

0+1 records in
0+1 records out
30137 bytes (30 kB) copied, 0.000665 seconds, 45.3 MB/s

Thanks,
Alex.

comment:17 Changed 11 years ago by lois

It is difficult to see where this comes from Alex while HECToR is down. It should be back Friday but it may be ext week before I can look.

Lois

comment:18 Changed 11 years ago by lois

Hello Alex,

it turns out that one of the UM files on HECToR was corrupted so this may be the cause of your problem. I have copied the file over from HPCx again and tested it and all looks ok.

So could you try running your job again but please remember to move the spectral files, those you use for the radiation scheme, to /work not /home. All files you use for the parallel jobs (running and reconfiguration) need to be on /work where as all files you use for serial jobs (compiling) can be on /home.

Lois

comment:19 Changed 11 years ago by alexrap

Hello Lois,

I moved the spectral files and re-run the job, but I'm getting the same error:

User Prognostic 66 ( Stashcode 151 ) RIVER SEQUENCE

*
ERROR!!! in reconfiguration in routine Rcf_Aux_File
Error Code:- 20
Error Message:- Dimensions of AUX file and dump file do not match
Error generated from processor 0
*
gc_abort (Processor 0 ): Job Aborted from Ereport

0+1 records in
0+1 records out
30137 bytes (30 kB) copied, 0.00064 seconds, 47.1 MB/s
Files in directory UM_DATAW= /work/n02/n02/alexrap/xecol

Thanks,
Alex.

comment:20 Changed 11 years ago by lois

Hello Alex,

I think I may have finally tracked down the problems but I have tried so many things that it is hard to say how many problems there actually are.

Firstly there was the problem of all the clashing mods as you have a collection of from Hadgem1, Hadgem1a, and Met Office mods. I have been using my job xdyqr to test everything and this is the set of mods I got to work. I have had to edit a few mods to remove bits which were clashing and these are in my /home space on HECToR. Please check the list carefully and copy my mods into your space if you are happy with them.

Secondly I turned off river routing. I don't think you need it and it has caused problems in the past. I believe that the Met Office are bring out another more robust version soon.

Thirdly I had to remove some Met Office NEC optimisation of the advection schemes (ECMWF quasi-cubic interpolation in the horizontal, quintic in the vertical) which seem to be the cause of the crashes. I did not investigate the precise reason why these were a problem on HECToR whereas they did not seem to cause a problem on HPCx.

HECToR is down for maintenance (again) this afternoon so I will not have time for any more checks. I hope that it is ok to leave this to you. Please let me know how you get on and if all is ok I will close this ticket.

Lois

comment:21 Changed 11 years ago by lois

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.