Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#937 closed help (fixed)

Job submitted on Hector in the wrong queue

Reported by: dh023729 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: 6.6.3

Description

Hi,

I try to run HadGGEM2-Es using 16x8 (128) CPUs on Hector. I submitted the job with a walltime of 12hr. Therefore, I expect the job will be in queue 4n_12hr. My jobID is 936986.sdb and my username is dh023729. I submitted last night, however, as I check this morning, the model runs on a very slow pace, and as I inquire my job by using 'qstat -u dh023739', the information shown as:

dh023729@hector-xe6-7:~/um/umui_out> qstat -u dh023729

sdb:

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time


936986.sdb dh023729 par:4n_1 xhgze000 15985 1 1 — 12:00 R 07:16

Apparently, the job is in queue 4n_1h, but confusingly ran over seven hours.

Besides, I tried to submit a job using 16x16 (256) CPUs, but it failed.
The .leave file is in:

/home/n02/n02/dh023729/um/umui_out/xhgzf000.xhgzf.d12291.t225501.leave

I tried 8x8 (64) CPUs as well, it worked fine in queue 4n_12h.

Any idea? Thanks,
Liang

Change History (6)

comment:1 Changed 7 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed

Liang

It's just a format issue:

qstat |grep dh023729
936986.sdb xhgze000 dh023729 00:00:02 R par:4n_12h

HAdGEM2ES is limited by the ocean model to the number of processors it can run on - 256 is too many for N96.

Grenville

comment:2 Changed 7 years ago by dh023729

Thanks, Grenville.

Can this also be the reason why HadGEM2-ES runs so slow on 128cpus? (The same job ran on 64cpus can be finished in 2hrs, significantly shorter.)

Cheers,
Liang

comment:3 Changed 7 years ago by grenville

Liang

I've looked a t how we ran the model for testing and we only ever ran with 96 processors - it'll be interesting to see where the time is being used in your run (if you have the timers on), but probably not a good idea to run with that decomposition.

Grenville

comment:4 Changed 7 years ago by dh023729

Hi Grenville,

I don't think I have the timer on. According to creating time of dumps, when run on 12x8(96), it takes about 1.5 hour to finish a month. It take 2 hour and 10 minutes when running on 8x8 (64). However, after 9 hour and 40 minutes, the job running on 16x8 (128) still cannot finish a month.

Cheers,
Liang

comment:5 Changed 7 years ago by grenville

Liang

Probably not worth letting the 128-processor job finish if you'll not get any useful information out.

Grenville

comment:6 Changed 7 years ago by dh023729

Thanks, Grenville.

I just kill it. According to what I've got so far, I think I will stick with 96-processor to run my experiment.

Best,
Liang

Note: See TracTickets for help on using tickets.