Opened 6 years ago

Closed 5 years ago

#1245 closed help (fixed)

compile problems on Archer

Reported by: agt Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 7.8

Description

Hi,

I'm trying to compile my first um job on archer but the compilation is failing.

The job is xjnfa and derived from one of Nick Klingaman's runs. It will be n512 vn7,8 ga3.

The first mention of error in the leave file xjnfa000.xjnfa.d14065.t171643.comp.leave for my job (as user agt) is:

"ModuleCmd_Switch.c(172):ERROR:152: Module 'cce' is currently not loaded"

My .profile seems to point to cce (and was copied from hector),

cheers,

Andy

Change History (12)

comment:1 Changed 6 years ago by ros

Hi Andy,

Can you please change permissions on your ARCHER /home and /work directories so that we can read them?

chmod -R g+rX /home/n02/n02/agt
chmod -R g+rX /work/n02/n02/agt

THanks.
Ros.

Last edited 6 years ago by ros (previous) (diff)

comment:2 Changed 6 years ago by grenville

The emails below are copies of the conversation going on outside the trac -

Hi Grenville:

r—r—r enabled on both now.

I hadn't seen those updated instructions so will follow now.

Also, is an ln -s to work/ necessary from home from a model point of view?

There was a line:

test -z "$PROFILEREAD" && . /etc/profile

but not as you listed, so I've added those,

cheers,

Andy

On 10/03/14 09:46, Grenville Lister wrote:

Andy

Please let us have read access to your home and work directories.

Do you have the following in your profile

. /etc/profile
. /etc/bash.bashrc

I added these to our ARCHER instructions recently .

Andy

Also, is an ln -s to work/ necessary from home from a model point of view?

This won't work on ARCHER (on HECToR it originally didn't work and then it did, but probably shouldn't have).
Here's some of the reasoning for this architecture:

Cray systems come without /home mounted as default and I had a chat to the Cray team on Friday to understand better the issues around this. There is the performance issue already mentioned, they also regard it as a risk to the whole system as problems on /home (potentially from a single user job) could bring down the PBS MOM nodes and with them the whole job submission system. There is also a more pragmatic problem, to mount /home new kit would have to be purchased and this will require negotiations as to who would purchase and install it

comment:3 Changed 6 years ago by agt

I've again performed the chmod commands as Ros suggested, with the "X" included also.

cheers,

Andy

comment:4 Changed 6 years ago by agt

Hi,

I've tried to compile and submit again, using a full extract, but this time there's another extract failure. The ext.out file in /home/agt/FCM_extracts says:

WARNING: svn://puma/UM_svn/UM/branches/dev/um/VN7.8_machine_cfg/src/configs/machines/cray-xc30-cce-archer/ext_libs/oasis3.cfg@13712: LINE 9:

%netcdf_lib_path: variable not expanded

any help is appreciated,

cheers,

Andy

comment:5 Changed 6 years ago by ros

Hi Andy,

Can you please check that you can ssh from PUMA to ARCHER without it prompting for password/passphrase? There is a permission denied error message in /home/agt/FCM_extracts/xjnfa/umbase/ext.out. I suspect after PUMA's relocation yesterday that you just need to run ssh-add to get the ssh-agent setup properly again.

The warning you pasted above I will look into, but it shouldn't stop your job working.

Cheers,
Ros.

comment:6 Changed 6 years ago by agt

Hi Ros,

indeed that was a problem although I thought I'd fixed it since it didn't raise the usual messages,

But now we're back to the compiling problem, see
xjnfa000.xjnfa.d14072.t113626.comp.leave

cheers,
Andy

comment:7 Changed 6 years ago by ros

Hi Andy,

This job is setup to use a non-standard version of GCOM which Oliver Darbyshire built on HECToR. Oliver has since left the Met Office so I can't ask him why this was necessary.

As a first start can you try switching off the compile override ~odarbysh/overrides/hector_cce_8.0_thread0_gcom_ios and we'll take it from there.

Cheers,
Ros.

comment:8 Changed 6 years ago by agt

Hi Ros,

thanks. I'm sorry for the delayed reply as I misplaced the latest ticket update.

I made a further change to remove another dependency on another user's directory. The result of that run is: xjnfa000.xjnfa.d14083.t161532.comp.leave .

Further having implemented your change, the compilation leave file is: xjnfa000.xjnfa.d14083.t164204.comp.leave but it appears to have got through the compilation now and is queuing up for reconfiguration.

So we are over the first hurdle at least!

I may also need to run a GA6 job at N512. I believe there is already one ready for archer, is that the case?

cheers,

Andy

comment:9 Changed 6 years ago by grenville

Andy

pl see xjant for GA6 N512 job - Karthee has been running this job for PLV - Karthee is the best contact for details.

Best

Grevnille

comment:10 Changed 6 years ago by agt

Hi Ros, (Grenville)

the GA3 model started to run and then shortly fell over: Please see /home/n02/n02/agt/umui_out/xjnfa000.xjnfa.d14083.t164204.leave . Some stash files were starting to be output, but containing no data. So my question is, is this just the general instability of this version of N512/GA3, or some other reason?

I note Grenville's comment above only 6 minutes ago…. so will investigate the GA6 version simultaneously.

thanks,

Andy

comment:11 Changed 5 years ago by annette

This ticket is being closed due to lack of activity.

Annette

comment:12 Changed 5 years ago by annette

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.