Opened 9 years ago

Closed 8 years ago

#486 closed help (fixed)

Setting up an esemble run using UMCET

Reported by: aschurer Owned by: jeff
Component: UM Model Keywords: UMCET
Cc: Platform:
UM Version: 4.5

Description

Hi, I am trying to set up an ensemble run on the UM vn 4.5 using UMCET and am having problems trying to get the model to run. I have followed the steps outlined on the NCAS website (http://ncas-cms.nerc.ac.uk/index.php/um-documentation/ncas-user-guides/1435-umcet). I have run the "compile" job (xfam#e) and am having problems with the "run" job (xfam#f).

Without the ensemble scripts the "run" job works fine.

To make things as simple as possible I have then set up the ensemble so that there is only one ensemble member, with an atmosphere dump identical to the initial run.

The ensemble script is:

CONTROL false

SPLIT factor1:1

MULTI_ADUMP factor1

When I submit this job the run falls over. The relavent section of the leave file, where the error is described, seems to be:

[1] >+ . allmods MODS_RECF /home/n02/n02/aschurer/umui_runs/xfamf-244145823/ensemble/1 /work/n02/n02/aschurer/tmp/tmp.nid00004.28603/member1/xfamf.uprecon

[1] >+ MOD_LIST=/work/n02/n02/aschurer/tmp/tmp.nid00004.28603/member1/modlist

[1] >+ MODSF=/work/n02/n02/aschurer/tmp/tmp.nid00004.28603/member1/xfamf.uprecon

[1] >+ export MODSF TEMP

[1] >+ test 3 -ne 3

[1] >+ cp /home/n02/n02/aschurer/umui_runs/xfamf-244145823/ensemble/1/MODS_RECF /work/n02/n02/aschurer/tmp/tmp.nid00004.28603/member1/modlist

[1] >cp: cannot stat `/home/n02/n02/aschurer/umui_runs/xfamf-244145823/ensemble/1/MODS_RECF': No such file or directory

[1] >+ CC=1

[1] >+ test 1 -ne 0

[1] >+ echo '*ERROR: Move of file failed' /home/n02/n02/aschurer/umui_runs/xfamf-244145823/ensemble/1/MODS_RECF '. Return code' 1

[1] >*ERROR: Move of file failed /home/n02/n02/aschurer/umui_runs/xfamf-244145823/ensemble/1/MODS_RECF . Return code 1

Could you possibly give me an indication to what the problem might be.
Thanks,
Andrew

Change History (12)

comment:1 Changed 9 years ago by jeff

  • Owner changed from um_support to jeff
  • Status changed from new to accepted

Hi

You have got compiling the reconfiguration turned on in xfamf, this needs to be a run only job so turn it off.

Jeff.

comment:2 Changed 9 years ago by aschurer

Hi Jeff, thanks for your reply.

I have turned off the reconfiguration as you have suggested and the run does get further than before.

Unfortunately it runs into a problem at a slightly later stage. From the leave file:

[1] > NUPDATE 6.0 09/02/10 12:11:59
[1] >* MODIFICATION SUMMARY - DECK qsmain PLDATE 06/28/00 LASTID qsrunloadmodule
[1] >
[1] >
[1] >
* PROCESSING DECK qsmain
[1] >
[1] >
* CAUTION: OVERLAPPING MOD DELETES NEW TEXT
[1] >
[1] > * PROCESSING DECK qsmain
[1] >
[1] >* CAUTION: OVERLAPPING MOD DELETES NEW TEXT
[1] >
[1] >
* PROCESSING DECK qsmain
[1] >
[1] >
[1] > * CAUTION: OVERLAPPING MOD DELETES NEW TEXT
[1] >
* CAUTION: OVERLAPPING MOD DELETES NEW TEXT
[1] >* CAUTION: OVERLAPPING MOD DELETES NEW TEXT
[1] >
* ERROR: UNCLOSED IF AT END OF DECK qsmain
[1] > UD011 - 1 FATAL UPDATE ERRORS
[1] > UD021 - 5 OVERLAPPING MODIFICATIONS
[1] >+ CC=1
[1] >+ test 1 -ne 0
[1] >+ echo updscripts: Error in nupdate command
[1] >updscripts: Error in nupdate command
[1] >+ echo updscripts: Nupdate command was :-
[1] >updscripts: Nupdate command was :-
[1] >+ echo nupdate -p /work/n02/n02/hum/vn4.5/source/umsl -d MPP,LINUX -i /work/n02/n02/aschurer/tmp/tmp.nid00004.18329/member1/xfamf.updates -o dc,ed,um,in -D -m 2
[1] >+ 1>> /work/n02/n02/aschurer/umxfamf/W/ensemble/1/xfamf.out
[1] >+ exit 1
+ date
+ echo 'processes finished Thu Sep 2 12:11:59 BST 2010'
processes finished Thu Sep 2 12:11:59 BST 2010
+ [ -gt 0 -a -f ]
+ exit 0

Thanks,

Andrew

comment:3 Changed 9 years ago by jeff

Hi Andrew

The problem is script mod script_fix.mod clashes with gen_env.mod. To fix this replace

$MODS_SCRIPTS/general/script_fix.mod

with

$UMCET_SCRIPTS/script_fix_umcet.mod

in the script mods umui panel.

Jeff.

comment:4 Changed 9 years ago by aschurer

Hi Jeff, Thanks again for your help.

I made the change you suggested in the UMUI and resubmitted the job. The job got further then last time but ran into further problems. It ran for all its time and then stopped without producing any climate outputs (or at least none that I can locate). I've looked in the leave file:

xfamf000.xfamf.d10249.t124014.leave

but cannot understand the nature of the problem. Have you any ideas?

Thanks,

Andrew

comment:5 Changed 9 years ago by jeff

Hi Andrew

Sorry not to get back to you sooner, I've been away. I don't have permission to look at files in your home directory so can't look at your .leave file. If you run this command it will give group read permission on your files

chmod -R g+rX /home/n02/n02/aschurer

It would be a good idea to do this on /work too

chmod -R g+rX /work/n02/n02/aschurer

Jeff.

comment:6 Changed 9 years ago by aschurer

Hi Jeff,

You should now have read permission for these files.

Thanks,
Andrew

comment:7 Changed 9 years ago by jeff

Hi Andrew

It looks like the problem is a file is in $DATAM but the UM tries to use it in $DATAW, I think this should be fixed now so try your run again. Most people set DATAW and DATAM to the same directory which is why this bug hasn't been found before.

Looking at your umui job there might be a problem with how you specify the directory for the executable, you have

/home/n02/n02/aschurer/work/um/xfame/W

But UMCET will access this on the compute nodes which cannot access /home so this won't work. You should change this to

/work/n02/n02/aschurer/um/xfame/W

Jeff.

comment:8 Changed 9 years ago by aschurer

Hi Jeff,

Thanks for the suggestions. The ensemble runs have now succesfully completed an NRUN, with the correct outputs. I cahanged NRUN to CRUN in the SUBMIT file (on PUMA: /home/aschurer/umui_jobs/xfamf/ensemble/Ens_Control/SUBMIT) and pressed SUBMIT in the UMUI. The run started in the correct place and appears to have produced all the correct outputs it has however stopped before completion and has not resubmitted the job so the run has stopped.

Looking at the leave file: xfamf000.xfamf.d10270.t222132.leave, the problem seems to be this:

[3] >+ /usr/bin/grep 'stoprun: Operator' /work/n02/n02/aschurer/tmp/tmp.nid00007.6025/member3/xfamf.errflag

[3] >+ 1> /work/n02/n02/aschurer/tmp/tmp.nid00007.6025/member3/xfamf.stopped

[3] >+ -s /work/n02/n02/aschurer/tmp/tmp.nid00007.6025/member3/xfamf.stopped ?

[3] >+ read PERCENT CURRENT_RQST_NAME CURRENT_RQST_ACTION CURRENT_⇒> PBS: job killed: walltime 20029 exceeded limit 20000

Thanks,
Andrew

comment:9 Changed 9 years ago by jeff

Hi Andrew

I don't have permission to read your xfamf .leave files, can you fix this. Thanks.

Jeff.

comment:10 Changed 9 years ago by aschurer

Hi Jeff, not sure why you didn't have permission to read these files. Anyway I've run the command chmod -R g+rX /home/n02/n02/aschurer again so hopefully this problem is now fixed.

comment:11 Changed 9 years ago by jeff

Hi Andrew

Sorry not to get back to you sooner, I've been busy with moving the UM to the XT6.

I've had a good look though your output file and I think I know what went wrong, there was nothing wrong with your job.

The way the ensemble script works is all 4 (in your case) ensembles act as a separate UM run, and go though the UM scripts as normal until they reach the aprun part. Here ensemble 1 runs the aprun script which launches all 4 UM jobs and it won't continue running the scripts until aprun returns after the UM finishes running. Ensembles 2,3,4 go into a barrier and wait until ensemble 1 has finished the aprun command, then all 4 ensembles continue through the UM scripts and finish.

What went wrong in your run is that ensemble 3 died (or was killed) while waiting in the barrier, this means the UM scripts never completed for this ensemble and this meant the archiving script got into an infinite loop and ran until you ran out of wallclock time. I've no idea why this happened, hopefully it was a one off. I see you have run this job again since you produced this output file, I don't have permission to read these files so can't see whether these jobs worked or had the same problem. Can you let me know.

Jeff.

comment:12 Changed 8 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.