Opened 12 years ago

Closed 10 years ago

#313 closed help (fixed)

UM 4.5.1 +MOSES2.1 experiment runs for a year then stalls

Reported by: sjh395 Owned by: lois
Component: UM Model Keywords:
Cc: Platform:
UM Version: 4.5


My experiment "xeapd" is derived from an experiment I ran on the Bristol quest cluster. Ive used the mods from experiment "xctlc" and annettes MOSES 2 mods. There are a couple of additional mods that remain from Bristol specific to my palaeo work. I am using start dump files derived from an experiment run at Bristol. To maintain compatibility with my analysis software I haven't changed any output diagnostic settings.

Problem: The experiment intially runs OK (32 proc 10 yr job), creating monthly output as expected. After 1 model year the experiment hangs (it doesnt crash) partially the way through writing output.

It always seems to stall at same point e.g. if I restart the experiment say 6 months before the stall point.

Any suggestions to what could be causing this? My files are readable on HECToR



Change History (9)

comment:1 Changed 12 years ago by lois

  • Owner changed from um_support to lois
  • Status changed from new to assigned

We will have a look at this problem Stephen when HECToR is back from the maintenance session.

More news soon I hope.


comment:2 follow-up: Changed 12 years ago by lois

Hello Stephen

I have looked at your leave file on HECToR and from my first glance I think your job has just run out of time. In the UMUI your job xeapd asked for 7200 secs. I am not sure quite what performance is expected of this job it may be that you thought you should get a better performance, which we can look at as well. In the mean time could you repeat your run asking for more time say 12 hours 43200 secs, the maximum queue length on HECToR. If this is not the cause of your problems please let us know.


comment:3 in reply to: ↑ 2 Changed 12 years ago by sjh395

Hi Lois,

I originally had the walltime set to 12 hours. I changed the time to 7200 seconds so that when it did job did fail it wouldnt hang the processors for a full 12 hours, I should have mentioned this. The experiment stalls after walltime of 1.25 hours.


comment:4 Changed 12 years ago by lois

Right, the next step is to copy your job and have a closer look.


comment:5 Changed 12 years ago by lois

Hello Steve,

I have been running your job (mine is xdyqh) and it looks as though it is just blowing up soon after year 1 (day 43).

To make your jobs run and run a bit faster I removed packing (known to be a problem on HECToR) and I removed a vast amount of the STASH, especially all the daily means! So then your job ran at about 6 minutes per model month, which is reasonable. When I ran it for 3 years I actually got some output and I can see it going off the rails. I extacted the summary from the .leave file and get this

TS= 9588 YEAR= 1.11 DAY= 39.5 ENERGY= 4.066829E+01 DTEMP= 4.431733E-08 \

DSALT= 1.841314E-10 SCANS= 25
TS= 9600 YEAR= 1.11 DAY= 40.0 ENERGY= 3.891213E+01 DTEMP= 4.065999E-08 \
DSALT= 2.664982E-10 SCANS= 26
TS= 9612 YEAR= 1.11 DAY= 40.5 ENERGY= 6.149334E+01 DTEMP= 1.020985E-07 \
DSALT= 1.132111E-09 SCANS= 28
TS= 9624 YEAR= 1.11 DAY= 41.0 ENERGY= 4.488589E+01 DTEMP= 5.102149E-08 \
DSALT= 2.866441E-10 SCANS= 29
TS= 9636 YEAR= 1.12 DAY= 41.5 ENERGY= 4.190122E+01 DTEMP= 7.264986E-08 \
DSALT= 8.487271E-10 SCANS= 31
TS= 9648 YEAR= 1.12 DAY= 42.0 ENERGY= 1.216391E+02 DTEMP= 1.738290E-07 \
DSALT= 3.324580E-09 SCANS= 43
TS= 9660 YEAR= 1.12 DAY= 42.5 ENERGY= 1.054298E+03 DTEMP= 3.350111E-07 \
DSALT= 1.260757E-08 SCANS= 35
TS= 9672 YEAR= 1.12 DAY= 43.0 ENERGY= NaN DTEMP= NaN \
TS= 9684 YEAR= 1.12 DAY= 43.5 ENERGY= NaN DTEMP= NaN \
TS= 9696 YEAR= 1.12 DAY= 44.0 ENERGY= NaN DTEMP= NaN \

Having started from this

TS= 12 YEAR= 0.00 DAY= 0.5 ENERGY= 1.876328E+00 DTEMP= 3.189666E-08 \
DSALT= 5.081278E-11 SCANS= 29
TS= 24 YEAR= 0.00 DAY= 1.0 ENERGY= 2.349751E+00 DTEMP= 2.602680E-08 \
DSALT= 3.824827E-11 SCANS= 33
TS= 36 YEAR= 0.00 DAY= 1.5 ENERGY= 2.560419E+00 DTEMP= 2.430526E-08 \
DSALT= 3.335675E-11 SCANS= 27

I don't know why it is failing, it will have to be you who looks at the model output to explore the different diagnostics to see if you can spot an issue. Do you have an equivalent run on the QUEST cluster you could compare with?

Let us know if we can help.


comment:6 Changed 12 years ago by sjh395


I do have a smilar experiment on Quest so will have a good look at the output. The only difference between the hector experiment and the one on Quest are the mods that are applied so I shall also have another good look at the Quest mods to see if any are required.



comment:7 Changed 12 years ago by sjh395

I've been having a good look at the output, but proving difficult to work out what is going wrong. I think I may have found something though. Whilst the model fails at 8x4 processors it seems (so far) to be OK for 8x1 (same as on QUEST). Have you come across this behaviour before?



comment:8 Changed 12 years ago by lois

Hello Steve,

The basic answer to your question is no we haven't seen this behaviour before because when we test with our standard jobs we make sure there is bit compatability between restarts and using different numbers of processors. But of course other (non-standard) jobs may have had similar problems to yours. You are running a rather non-standard job, not one we have ever tested, and therefore any one of the 'extras' in your job could be the cause of the problem. It is really challenging and labour intensive getting to the bottom of these problems. You have some choices as to what to do next. You could accept the problem and run your control and changed experiments on the same 8x1 configuration or tackle yourself the bit-compatability issue or we could look at the problem but this could take us weeks/months!



comment:9 Changed 10 years ago by lois

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.