Opened 3 years ago

Closed 3 years ago

#2062 closed help (fixed)

UM 6.6.3 jobs hanging

Reported by: jscreen Owned by: willie
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 6.6.3

Description

Hi

I'm trying to run HadGEM2-ES (6.6.3). I have a number of so-called test runs that derive from different existing jobs:

xnbgc is modified from a Met Office CMIP5 RCP8.5 simulation
xnbgd is modified from a Met Office CMIP5 historical simulation
xnbge and xnbgf are modified from Grenville's HadGEM2-ES run (xgaja)

These runs have different ancillaries and STASH (amongst other things) but they are all suffering a common problem. The jobs submit ok, the reconfiguration proceeds fine. The jobs appear to run but output no data (beyond the initial creation of the first set of output files). I've played around with the job length and dumping frequency and from what I can tell the jobs aren't even completing 1 day (despite running for up to 5 hours). Eventually the jobs crash due to hitting the walltime limit. The .leave files don't contain anything obvious to point to the problem, but the fact that the same thing is happening for all four jobs must mean something (I'm just not sure what!)

James

Change History (8)

comment:1 Changed 3 years ago by willie

Hi James,

You're getting

lib-4324 : UNRECOVERABLE library error 
  The variable name '2050,2051,2052,2053,' is unrecognized in namelist input.

So it looks like you've extended a name list array somewhere and possibly added the values in the wrong place?

Regards
Willie

comment:2 Changed 3 years ago by grenville

James

xgaja is a Hector job (that shouldn't make much difference) - xgada is the standard HadGEM2-ES job. It also ran on hector but setting the machine to login.archer.ac.uk definitely works.

Willie appears to have found the difference since I started writing.

Grenvile

comment:3 Changed 3 years ago by jscreen

Willie

Argh yes, that error arose when I fiddled with something or the other (can't remember quite what) when trying to solve the hanging issue. I don't think that is the cause of the common hanging problem. I haven't seen that message for either xnbge or xnbgf which are also hanging.

Please could you look at xnbge and xnbgf to diagnose the problem.

Thanks, James

comment:4 Changed 3 years ago by jscreen

If it helps the job xnbgd is "running" now and appears to be hanging as we speak

comment:5 Changed 3 years ago by willie

Hi James,

This is a problem with the processor configuration. Your job xnbge has 16x12 for both the model and reconfiguration. If you revert to 12 EW X 8 NS for the model and 8x8 for the reconfiguration it should work. You also need to "override year in dump with year in model" for both the atmosphere and ocean start dumps.

Regards
Willie

Last edited 3 years ago by willie (previous) (diff)

comment:6 Changed 3 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

comment:7 Changed 3 years ago by jscreen

Thanks Willie, they are running fine now.

I'm sure I've run with that processor configuration before (but maybe only for a atmosphere-only job). At least it was a simple problem to fix!

James

comment:8 Changed 3 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.