Opened 9 years ago

Closed 9 years ago

#388 closed help (fixed)

UM model hogs entire node on MONSooN

Reported by: lherma Owned by: um_support
Component: MONSooN Keywords: node hogging, single processor
Cc: Platform:
UM Version: 4.7

Description

My current version of HadCM3 is a single processor job, but on MONSooN even though there are 32 processors on each node, each job takes a whole node.

I am aware of the UMCET framework, but each of my ensemble members has a different startdump that needs reconfiguration, so I don't think I can use it.

I have also tried using
#@ node_usage = shared
but it didn't help.

Let me know if you have any bright ideas!

Change History (5)

comment:1 Changed 9 years ago by jeff

Why don't you run your job on 32 processors?

Jeff.

comment:2 Changed 9 years ago by lherma

I had considered that. However, the speed-up is poor when you run HadCM3 on 32 processors. So it seems to me that to run many large ensembles it is more efficient to run many jobs at the same time on fewer processors.

I am testing a 32-processor job now. If there is no other ready solution, then I will probably use that option. Do you have any other ideas?

comment:3 Changed 9 years ago by lherma

Update: Using a 32-processor job is not an option. It runs well for a while (I think it did the whole NRUN). But in the CRUN half the jobs segmentation fault without any indication of why (see xeqgd000.xeqgd.d10047.t140246.leave in the output directory).

I have never changed the number of processors on the fly before, do I need to recompile? I have tried 24 processors, but the same thing happens. I have also lengthened the target run length without any improvement.

comment:4 Changed 9 years ago by lherma

Update 2:
#@ node_usage = shared
This command does work, I just hadn't noticed that there were another set of loadleveller commands sneakily hidden further down in the SUBMIT script where it was set to not_shared.

Anyway. You can close this now. Thanks!

comment:5 Changed 9 years ago by jeff

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.