Opened 10 years ago

Closed 10 years ago

#535 closed help (fixed)

Problems running HadCM3 on phase2a

Reported by: aschurer Owned by: lois
Component: UM Model Keywords:
Cc: Platform:
UM Version: 4.5

Description

Hi,
I have been running several HadCM3 experiments on both phase2a and phase2b. These have been running for several hundred model years so the initial setup seems to be OK. Over the last few days several of the runs have failed.

Last night, 4 almost identical runs (xfic #a,#b,#c,#d), failed on phase2a with the following error in the leave files (xfica059.xfica.d10314.t185131.leave, xficb060.xficb.d10314.t205616.leave,
xficc058.xficc.d10314.t180917.leave, xficd059.xficd.d10314.t185057.leave):

xficc: Starting run
address space limit (kbytes) (-M) unlimited
core file size (blocks) (-c) unlimited
cpu time (seconds) (-t) unlimited
data size (kbytes) (-d) unlimited
file size (blocks) (-f) unlimited
locks (-L) unlimited
locked address space (kbytes) (-l) 512
nofile (-n) 1024
nproc (-u) 65536
pipe buffer size (bytes) (-p) 4096
resident set size (kbytes) (-m) unlimited
socket buffer size (bytes) (-b) 4096
stack size (kbytes) (-s) 8192
threads (-T) not supported
process size (kbytes) (-v) unlimited

aprun: -S must be a positive nonzero integer
aprun: Exiting due to errors. Application aborted
xficc: Run failed

Two nights ago (09/11), two runs, xfhx#l and xfhv#e, failed at what appears to be exactly the same time, 20:38, on phase2b, both with the same error in the leave files (although different from the previous errors) (xfhve086.xfhve.d10313.t182934.leave,xfhxl002.xfhxl.d10313.t165240.leave):

xfhve: Starting run
address space limit (kbytes) (-M) 6495440
core file size (blocks) (-c) unlimited
cpu time (seconds) (-t) unlimited
data size (kbytes) (-d) unlimited
file size (blocks) (-f) unlimited
locks (-L) unlimited
locked address space (kbytes) (-l) 64
nofile (-n) 1024
nproc (-u) 63372
pipe buffer size (bytes) (-p) 4096
resident set size (kbytes) (-m) unlimited
socket buffer size (bytes) (-b) 4096
stack size (kbytes) (-s) 8192
threads (-T) not supported
process size (kbytes) (-v) unlimited

SETPOS: Seek Failed: Input/output error
_pmii_daemon(SIGCHLD): PE 0 exit signal Aborted
[NID 02306] 2010-11-09 20:33:45 Apid 246504: initiated application termination
qsmain: Copying /work/n02/n02/aschurer/umxfhve/W/xfhve.thist to backup thist file /work/n02/n02/aschurer/umxfhve/W/xfhve.thist_keep
xfhve: Run failed

Is it safe to assume that these errors are as a result of the filesystem migration? And that resubmitting the runs won't lead to further failed runs? Or do I need to make any changes to my experiment setup?

Thanks,
Andrew

Change History (2)

comment:1 Changed 10 years ago by lois

  • Owner changed from um_support to lois
  • Status changed from new to accepted

Hello Andrew,

we are working on instructions for running from the new joint esf /work system but have not released them yet as there are still a few issues that HECToR are sorting out from yesterday's last move in this lengthy process. We are still in the

If you compiled your jobs on phase 2a and are now running on phase 2a then you need to process them (you don't need to re-compile) in the UMUI (CRUNs will have to be set as usual) to pick up the necessary changes. These jobs then should run however the archive seems to have problems and HECToR are investigating this now.

We would not recommend running phase 2b compiled jobs on phase 2a and HECToR don't recommend running phase 2a compiled jobs on phase 2b!

But the basic message is if you are experiencing these sort of problems, re-process your job in the UMUI to pick up the changes needed to cope with the mix of phase 2a and phase 2b.

Let us know if there are still problems.

Lois

comment:2 Changed 10 years ago by lois

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.