Opened 7 years ago

Closed 7 years ago

#904 closed error (fixed)

HadGEM2-ES crash

Reported by: aschurer Owned by: um_support
Component: UM Model Keywords: MONSooN, HadGEM2,crash
Cc: Platform:
UM Version: 6.6.3

Description

Hi, my model (xhinc) has been running on the new phase of MONSooN with automatically archived output and has completed ~ 8 model years. It has now crashed.

The only error I can see in the leave file
/home/aschur/output/xhinc010.xhinc.d12252.t000658.archive
is:

Use of uninitialized value $JOBDIR in concatenation (.) or string at
/projects/lastmil/aschur/um/xhinc/bin/qsmoose line 135.

Which would suggest that there is an error with the archiving

However I have noticed this error in previous leave files for
runs which seem to have successfully completed with archived output so perhaps this isn't what is causing the crash. If this is the case I have no idea what is the problem.

Any help would be appreciated,

thanks Andrew

Change History (6)

comment:1 Changed 7 years ago by willie

Hi Andrew,

There are no files /home/aschur/output/xhinc* on MONSooN. The leave file will help us to analyse the situation.

Regards

Willie

comment:2 Changed 7 years ago by aschurer

Hi Willie,
Which phase are you looking at? In phase 2 (ibm02) there are several xhinc leave files. Including the final one before the crash:
/home/aschur/output/xhinc010.xhinc.d12252.t000658.archive
Regards,
Andrew

comment:3 Changed 7 years ago by willie

Hi Andrew,

Thanks. I found them. I'm not sure what you mean by "crash". All 64 processors exited normally at 00:52 and then the job finished without error at 01:30.

Regards

Willie

comment:4 Changed 7 years ago by aschurer

Hi Willie,

Sorry perhaps "crash" is the wrong word. I could not find any obvious error but as far as I can tell the model has not produced any output from this job and has not been resubmitted. The run has stopped prematurely (it should be running for another 38 model years).

Any idea why this might have happened?

Thanks,
Andrew

comment:5 Changed 7 years ago by willie

Hi Andrew,

It looks like the last resubmit ran out of time: it took 5,000 seconds and you've only allocated 5,000 seconds. I think you should increase this to say 7,200 seconds and it should work.

Regards,

Willie

comment:6 Changed 7 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.