Opened 8 years ago

Closed 7 years ago

Last modified 7 years ago

#625 closed help (fixed)

Running UMCET on phase2b

Reported by: agt Owned by: lois
Component: UM Model Keywords: umcet, vn4.5, phase2b
Cc: Platform: HECToR
UM Version: 4.5

Description

Hi all,

I'm not sure if you're still supporting UMCET or not, but if you are I'd appreciate if you could adapt its scripts so it will work on phase 2b. I've not used it for some time but it may come in handy for doing multi-member vn4.5 jobs, to ensure that the model fills the resource etc.

From what I can determine, startup.mod and gen_env.mod (origin: /work/n02/n02/hum/umcet/latest/um/um4.5) will need editing (possibly simply replacing "xt4" with "xe6" in their if loops).

More difficult is the call to ect. When processing an UMCET job in the UMUI (before submission, e.g., see job xgboe) one gets the error "Unsupported platform: phase2b.hector.ac.uk". It then assumes the platform is ibm.p690, which is obviously wrong. From what I can tell, this error message comes from /work/n02/n02/hum/umcet/latest/ect/ect.pl . There will also be a missing configuration file in the ./configs directory which may just need the pes/node value changing, although I'm not sure how the scripts in ect/ directory are determining which one of those config files to read.

Any help appreciated,

thanks,

Andy

Change History (6)

comment:1 Changed 8 years ago by lois

  • Owner changed from um_support to lois
  • Status changed from new to assigned

Hello Andy,

we certainly do continue to offer support for UMCET, even if there are not many users, as it offers an excellent way of improving throughput of ensemble jobs and can enable us access to the HECToR capability incentives.

However, as you say, we have not adapted the scripts for the XE6, which should not be a diffcult task, but not one we can do this week. The two people who could do this are either on leave and on a course this week. I hope that this can be quickly done next week.

Lois

comment:2 Changed 8 years ago by agt

Thanks Lois,

that will be great,

Andy

comment:3 Changed 8 years ago by simon

Hi,

I've just had a look at this and the UMCET part of the umui wasn't updated for phase2b. I've now made the required edit. Could you try again? You'll have to restart the umui session for your
run to ensure the changes are picked up.

Simon.

comment:4 Changed 8 years ago by agt

Simon,

thanks for your reply and the changes you have made to the UMUI. Unfortunately I think there is still more to this. The UMUI changes have at least allowed the run part of the job to process properly, although on submission the UMUI issues warnings:
Use of uninitialized value $Submit::STACK in concatenation (.) or string at (eval 3) line 1.
Use of uninitialized value $SUBMIT::NPROC in concatenation (.) or string at (eval 3) line 1.

The jobs at least go into the parallel queue, and start (and a fair bit of the set up is done) before falling over.

I have two sets of jobs tested:
xgbob/xgboc are the compile and run parts set up as standard
xgbod/xgboe are as above, but startup.mod (source), gen_env.mod (script) I have modified, simply replacing "XE4" with "XE6" (see /work/n02/n02/agt/umcet/um4.5 for those modified versions).

Both of those run jobs (xgboc, xgboe) give the umui messages as above.
In xgboc000.xgboc.d11153.t170129.leave and xgboe000.xgboe.d11153.t170522.leave there are still messages like:
"UMCET_MACHINE=xt4"
which may have a bearing on the failure.
In xgboc, the failure seems due to APRUN: "aprun: -N must be a positive nonzero integer" because presumably the ncores per node is not being set up at all for this machine. In xgboe, the failure is still due to APRUN, yet the message is different. I presume that the changes to the source/script I have made at least define the N integer as something, but presumably the value for xt4 rather than xe6. In fact both of those .mod files need an input of $UMCET_MACHINE which is being supplied by somewhere: the startup.mod contains the line:

CALL FORT_GET_ENV('UMCET_MACHINE',13,MACHINE_ENV,100,err)

I think the failure must lie in the lack of an ect config file for the phase2b. In the directory /work/n02/n02/hum/umcet/latest/ect/configs there are a series of config files for each machine. I don't know how this file is called, but there isn't one for the xe6 (there is one called "cray.xt4" for example). However those files do contain the line:
$CONFIG_PES_PER_NODE = 2
which will need changing.

I hope this info helps!

thanks,

Andy

comment:5 Changed 7 years ago by ros

  • Platform set to <select platform>
  • Resolution set to fixed
  • Status changed from assigned to closed

comment:6 Changed 7 years ago by ros

  • Platform changed from <select platform> to HECToR
Note: See TracTickets for help on using tickets.