Opened 7 years ago

Closed 7 years ago

#1076 closed help (fixed)

Running UKV in HECToR - mpirun: not found

Reported by: oma Owned by: um_support
Component: UM Model Keywords: mpirun
Cc: Platform: HECToR
UM Version: 7.3

Description

Hello,

I am trying to run a UKV job in HECToR (username: oma, job:xiofa) but the following error appears:

*********************************************************
UM Executable : /work/n02/n02/oma/xiofb/bin/UM7.3_UKV_PS22.exe
*********************************************************


/work/n02/n02/oma/xiofa/bin/qsexecute[982]: mpirun: not found [No such file or directory]
xiofa: Run failed
*****************************************************************
   Ending script   :   qsexecute
   Completion code :   127
   Completion time :   Mon Jun  3 16:00:09 BST 2013
*****************************************************************

The job is a copy of the UMUI job xerif, and I compiled the model with a copy of the UMUI job xerig.

I was reading in another ticket that the problem might be in the call to mpirun instead of aprun to do the parallel job, but I have no idea on how to correct this. Therefore, any help would be much appreciated.

Thanks,

Oscar

Change History (11)

comment:1 Changed 7 years ago by willie

Hi Oscar,

I don't get this problem. Check your .profile contains the following

UMDIR=/work/n02/n02/hum
TARGET_MC=cce

export VN=7.3


# Setup UM variables

if [ -f $HOME/.umsetvars_$VN ]
then
  . $HOME/.umsetvars_$VN
else
  . $UMDIR/vn$VN/$TARGET_MC/scripts/.umsetvars_$VN
fi
:
:
. $UMDIR/bin/loadcomp
loadcomp $TARGET_MC

My version failed due to "Wrong number of atmospheric prognostic fields", so you should reconfigure the start dump.

Regards,

Willie

comment:2 Changed 7 years ago by oma

Hi Willie,

I've modified my .profile file to look like yours but I still get the same error. When I request reconfiguration I get a similar error but at the reconfiguration step (See below). Apparently mpirun cannot be found…

Do you think I'm missing some settings in the UMUI? What model did you try running? Was it my compilation or another compilation of yours? Perhaps if I can do exactly what you did I can manage to run it?

Oscar

/work/n02/n02/oma/xiofa/bin/qsexecute: Executing dump reconfiguration program

*********************************************************
RCF Executable : /work/n02/n02/oma/xiofb/bin/qxreconf
*********************************************************


/work/n02/n02/oma/xiofa/bin/qsexecute[414]: mpirun: not found [No such file or directory]
/work/n02/n02/oma/xiofa/bin/qsexecute: Error in dump reconfiguration - see OUTPUT
*****************************************************************
   Ending script   :   qsexecute
   Completion code :   127
   Completion time :   Tue Jun  4 17:54:13 BST 2013
*****************************************************************

comment:3 Changed 7 years ago by willie

Hi Oscar,
I just copied your experiment xiof and changed the user details - see user 'willie', experiment xiox. You could try deleting the xiofa work directory and repeating the run.

Regards

Willie

comment:4 Changed 7 years ago by oma

Hi Willie,

I tried deleting the directories and starting again and I made some progress. It tries to run but it stops with the following error.

 *****************************************************************************
 ERROR!!! in reconfiguration in routine Rcf_Exppx
 Error Code:-  2
 Error Message:-  Cant find required STASH item  7  section  0  model  1  in STASHmaster
 Error generated from processor  0
 *****************************************************************************

I suspect I need to supply modifications to the STASHmaster?? Would you have files or examples of such modifications available?

Thanks,

Oscar

comment:5 Changed 7 years ago by willie

Hi Oscar,

This occurs because you are using recent start dumps with an older model, so the model doesn't know about all the data. This can be overcome by creating user STASH files tom remove the data. See the examples on PUMA in /home/umui/interim_6.1_6.6_USTASH/. You can copy the stash item from a recent e.g. 8.2 STASHMaster file. You may have to modify the space code - this is discussed in UMDP C4.

Regards

Willie

comment:6 Changed 7 years ago by oma

Hi Willie,

I seem to have overcome the previous problem by supplying a user STASHmaster file as suggested. I'm now encountering the following new error message:

 !!!! ANCIL_MSTR /work/n02/n02/hum/vn7.3/ctldata/ANCILmaster/ANCILfields_A                                                           
     
 HdAncilM : No of ANCILmaster records in ANCILfields_A 175
 !!!! ANCIL_MSTR /work/n02/n02/hum/vn7.3/ctldata/ANCILmaster/ANCILfiles_A                                                            
     
 HdAncilM : No of ANCILmaster records in ANCILfiles_A  44
 Total No of ANCILmaster records (Fields)  175
 Total No of ANCILmaster records (Files )  44
  
 Ancillary Files to be opened : 
 File No  1  Ozone                                    
 File No  9  Land Sea Mask                            
 File No  10  Orography                                
 File No  12  Murkiness                                
 File No  17  Initial fractions of surface types       
  
 **FATAL ERROR WHEN READING/WRITING MODEL DUMP**
 buffer in of fixed length header
 Error code =   1.00
 Length requested            =       256
 Length actually transferred =         0
  Fatal error codes are as follows:
 -1.0 Mismatch between actual and requested data length
  0.0 End-of-file was read
  1.0 Error occurred during read
  2.0 Other disk malfunction
 3.0 File does not exist
 ***********************************************
 Problem in reading fixed header from Anc File.
 File : /work/n02/n02/hum/ancil/atmos/ukv/aerosols/gems/v1/                                                                     
 *****************************************************************************
 ERROR!!! in reconfiguration in routine Calc_nlookups
 Error Code:-  1
 Error Message:- Problem with reading fixed header from Ancillary File.
 Error generated from processor  0
 *****************************************************************************

What do you think is happening now? It looks like a problem with the ancillary files. Any suggestions would be very welcome.

Thanks,

Oscar

comment:7 Changed 7 years ago by willie

Hi Oscar,

The intention of the xeri* series was to start from a global start dump and carry out the following steps,

  1. run xeria to produce the NAE and LBC files
  2. run xerig to create the UKV executables
  3. run xerie with data input from xeria to create the variable resolution start dump and LBC files
  4. run xerid with the start dump from xerie
  5. run xerif with the start dump xerid and the LBC outputs of xerie as input

The executable code for xerid and xerif should point to the those built in the build job xerig. The reconfiguration step (4) is necessary because the start dump output by xerie has too many prognostic variables.

Regards,

Willie

comment:8 Changed 7 years ago by oma

Hi Willie,

I think the model executable is fine and I'm trying to reconfigure the data to get the right number of prognostic variables. However there seems to be a mismatch in the files called by the reconfiguration. As you can see in the error message above it says that the problem is in

File: /work/n02/n02/hum/ancil/atmos/ukv/aerosols/gems/v1/

which doesn't really give a file name!!

I had a look into $UMDIR to figure out what would the correct file be but I couldn't find where the source of error is although I suspect is in the total aerosols section of the ancillary files.

Do you have an example of a UKV run starting from UKV startdump? Or could you try running my job again to see if you get the same error?

Thanks in advance,

Oscar

comment:9 Changed 7 years ago by willie

Hi Oscar,

Your start dump $DEVTDIR/ukv/20120430_qwqv06.T+1 is the output of the UKV model (e.g. xerif).

Regards

Willie

comment:10 Changed 7 years ago by oma

Hi Willie,

This message is rather long but I thought it would be useful to include it to describe how I solved the reconfiguration problem in the end.

I've copied the umui job xerid to reconfigure the startdump (UKV new UM version). I then found the following error message:

       Copying Field 56 ( Section   0 ) ( Stashcode 241 ) CANOPY SNOW CAPACITY           KG/M2
 *****************************************************************************
 ERROR!!! in reconfiguration in routine Rcf_vertical
 Error Code:-  10
 Error Message:- No interpolation, but data field sizes/levels are different!
 Error generated from processor  0
 *****************************************************************************

Looking into old tickets (Ticket #884) I found a suggestion by Grenville to include the hand-edit ~grenville/hand_edits/handedit_1to9tiles, with the following contents:

# Allow reconfiguration of a 1-tile dump to a 9-tile dump

pwd

cat RECONA | perl -pe 's{\n}{<%newline%>}' |                   \
 perl -pe 's{(&ITEMS.*)}{                                      \
 &ITEMS ITEM=229, DOMAIN=1, SOURCE=8, &END                     \
 &ITEMS ITEM=230, DOMAIN=1, SOURCE=6, User_Prog_RConst=-1 &END \
 &ITEMS ITEM=233, DOMAIN=1, SOURCE=8, &END                     \
 &ITEMS ITEM=234, DOMAIN=1, SOURCE=6, User_Prog_RConst=-1 &END \
 &ITEMS ITEM=236, DOMAIN=1, SOURCE=3, &END                     \
 &ITEMS ITEM=237, DOMAIN=1, SOURCE=3, &END                     \
 &ITEMS ITEM=240, DOMAIN=1, SOURCE=8, &END                     \
 &ITEMS ITEM=490, DOMAIN=1, SOURCE=3, &END                     \
 \1} ; s{<%newline%>}{\n}g' > RECONA_new

mv -f RECONA RECONA_old
mv -f RECONA_new RECONA

Since my problem was field 241 I simply copied the line corresponding to ITEM=240 and changed ITEM=240 to ITEM=241. I repeated this for ITEM=242 after I received a similar message for this field. I then received a new error message:

 Initialising snow amount on tiles from gb mean snow
 *****************************************************************************
 ERROR!!! in reconfiguration in routine Rcf_Field_Calcs
 Error Code:-  30
 Error Message:-  No Field Calculations specified for section  0 item 241
 Error generated from processor  0
 *****************************************************************************

Apparently this one occurred because I had set SOURCE=8 in the lines I added. This setting expects some sort of calculations for the fields involved. Since I didn't know what calculations to suggest I set it to SOURCE=3, which sets the field to ZERO. I'm very unsure about this step. However it seemed to work. The configuration is complete and hopefully the model will run.

Thanks,

Oscar

comment:11 Changed 7 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.