Opened 6 months ago

Closed 5 months ago

#2151 closed help (answered)

Problem with a Global Reconfiguration job

Reported by: sam89 Owned by: um_support
Priority: high Component: UM Model
Keywords: Global Cc:
Platform: Monsoon2 UM Version: 8.2

Description

When I set the reconfig job to run it seems to fall over straight away and I keep getting this error

ERROR: the Ancil filenames version /projects/um1/ancil/ancil_versions/filenames/v2 not found

I then went looking in monsoon to find out where the ancillaries are now and I found what I think is them in
/projects/um1/ancil/data/ancil_versions/N512/v2 so I changed the orography and land sea mask file directory to this extentsion and then changed the filename to ancils. I reran it and got the same error again so I was wondering if you could look into it to figure out the issue.
The build job is xnjja and the recongif job I tried is xnjjj.
I assume I am missing changing something in the job since it appears to still be looking for the /projects/um1/ancil/ancil_versions/filenames/v2 directory but I can't see where I am missing changing it and I am unsure I have changed it to the right thing anyway…

Thanks

Sam Clarke

Change History (17)

comment:1 Changed 6 months ago by willie

Hi Sam,

It looks like the ancil version files have been moved on the new Monsoon. You should revert to the initial set up and then change the location in UMUI page Ancillary and Input data > infile related > Ancillary version files. Change

$UMDIR/ancil/ancil_versions/... -> $UMDIR/ancil/data/ancil_versions/...

in both boxes. That should fix it.

Regards
Willie

comment:2 Changed 6 months ago by sam89

I changed both boxes to $UMDIR/ancil/data/ancil_versions/n512/ps30/v2 but it is saying this error:

ERROR: the Ancil filenames version /projects/um1/ancil/data/ancil_versions/n512/ps30/v2 not found

Last edited 6 months ago by sam89 (previous) (diff)

comment:3 Changed 6 months ago by willie

Hi Sam,

I think you need to add /ancils to the end - it's looking for a file rather than a directory.

Willie

comment:4 Changed 6 months ago by sam89

Hi Willie

Thanks for that, it is working now!

Sam

comment:5 Changed 6 months ago by sam89

I seem to be having a seperate issue now:

lib-4205 : UNRECOVERABLE library error

The program was unable to request more memory space.

tcmalloc: large alloc 16744049999630761984 bytes == (nil)
tcmalloc: large alloc 16744049999630761984 bytes == (nil)
tcmalloc: large alloc 16744049999630761984 bytes == (nil)
tcmalloc: large alloc 16744049999630761984 bytes == (nil)

I have not seen this error before…

comment:6 Changed 6 months ago by willie

Hi Sam,

The new Monsoon has 36 cores per node, so you could use 12EW x 9NS to get an exact multiple and more processors, so more memory available per processor. See http://collab.metoffice.gov.uk/twiki/bin/view/Support/WhatIsMONSooN for details.

Willie

comment:7 Changed 6 months ago by sam89

Hi Willie

I tried changing to 12 x 9 and other variations upon that but I am still receiving the same error in the output file.

Sam

comment:8 Changed 6 months ago by willie

Hi Sam,

Some unnecessary switches had been set, so I changed these and it now works (see my job xnjsa). This reconfiguration job uses the executables built in xnjja and so does not need to build anything, so I

  • switched off the FCM option to include modifications from a branch,
  • Deselected enable UM scripts,
  • changed the ozone ancillaries back to their environment variables
  • changed the reconfiguration processors to 12x9, a multiple of 36 (not the model processors, which are not used in this job)

Regards
Willie

comment:9 Changed 6 months ago by sam89

Hi Willie

I copied your job across and changed it so that it was for my username etc and I set it to run but it seems to be stuck. I set it running at 9:41 and it has not output anything and has only created a .rcf.leave file and not a .comp.leave file. I know the reconfig jobs are usually quick so I am not sure why it seems to be stuck. It didn't seem to fail but it also isn't progressing.

Also how do i check where my job is in the queue on monsoon as i tried llq -u saclar and it said llq does not exist - has it changed?

Sam

comment:10 Changed 6 months ago by willie

Hi Sam,

I think you might need to switch UM Build scripts on - I'm not sure why.

The new Monsoon uses qstat instead of llq.

regards
Willie

comment:11 Changed 6 months ago by sam89

I switched it on but now it won't even submit as I keep getting this error

ERROR: puma.nerc.ac.uk: Permission denied while attempting to access account saclar on host xcslc0. Note that repeated failures may result in expiry of password due to security procedures on some machines. Check user id, hostname and password for your account on the host machine.

I checked that I can log in to both puma and monsoon seperately and I can so that is not the issue. I also didn't get this error when I last tried before switching the UM scripts on. Perhaps there is somewhere in the job I have missed changing from your directory back to mine…I cannot find it if this is the case though.

Sam

comment:12 Changed 6 months ago by willie

Hi Sam,

In xnjjy, you need to change the FCM extract directory to your own.

Willie

comment:13 Changed 6 months ago by sam89

Hi Willie,

Thanks. I tried running it again and now get this error:
RCF Executable : /projects/diamet/saclar/xnjja/xnjja.arecon
*

[Tue May 2 11:06:55 2017] [c4-2c1s8n0] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(506):
MPID_Init(192)…….: channel initialization failed
MPID_Init(569)…….: PMI2 init failed: 1
/projects/diamet/saclar/xnjjy/bin/qsrecon: Error in dump reconfiguration - see OUTPUT
*

Ending script : qsrecon
Completion code : 1
Completion time : Tue May 2 11:06:55 GMT 2017

*

/projects/diamet/saclar/xnjjy/bin/qsmaster: Failed in qsrecon in job xnjjy

<<<< Information about How Many Lines of Output follow >>>>
20 lines in main OUTPUT file.
0 lines of O/P from pe0.
<<<< Lines of Output Information ends >>>>

I went to look at the pe_output files for the job but there seem to be none that exist. I imagine it is failing to find one of the ancillary files but I am unable to determine which is is due to there being no output files. Perhaps it is all of them that it is unable to find the ancillary files for since it only output 10 lines of output…

Sam

comment:14 Changed 5 months ago by willie

Hi Sam,

I think this is a Monsoon problem. We are investigating it.

Regards
Willie

comment:15 Changed 5 months ago by willie

Hi Sam,

It's not a Monsoon problem at all. Just switch on the FCM option to include modifications from a branch.

Regards
Willie

comment:16 Changed 5 months ago by grenville

Closed for lack of activity

comment:17 Changed 5 months ago by grenville

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.