Opened 6 years ago

Closed 6 years ago

#1443 closed help (fixed)

8.4 Run Crash Problem

Reported by: pliojop Owned by: grenville
Component: UM Model Keywords: memory, cce, hadgem3
Cc: Platform: ARCHER
UM Version: 8.4



I have a version 8.4 job (xkmwb) running on Archer. It is a copy of Grenville's Archer test job, xgwtq (HadGEM3 atmosphere only with GA4). At present it runs for about 5 hours,about 6 months worth of model time before crashing with the error:

UM Executable : /work/n02/n02/japope/um/xkmwb/bin/xkmwb.exe

mkdir:: File exists
[NID 04011] 2015-01-19 16:37:25 Apid 12632862: initiated application termination
[NID 04011] 2015-01-19 16:37:28 Apid 12632862: OOM killer terminated this process.
xkmwb: Run failed

I had already encountered this error and had changed the job from running on 6 NS processors to 12 processors based on a review of other tickets with OOM errors. However the error still persisted.

There is another error in the file, which may be linked the recurrence of:


Conservation enforcement failed
Run continuing using best estimate


non-conservation for field 4

in the leave file. Not sure if they are linked in anyway.

The leave file for this run is




Change History (10)

comment:1 Changed 6 years ago by grenville


Please let us have read permission in your space - please type at the ARCHER command line:

chmod -R g+rX /home/n02/n02/japope
chmod -R g+rX /work/n02/n02/japope


comment:2 Changed 6 years ago by pliojop

Hi Grenville,



comment:3 Changed 6 years ago by grenville


We had some problems with OOM errors with an 8.2 model built with cce8.2.1. A rebuild of the model with cce8.3.3 resolved that problem. I am testing your model under cce8.3.3. I'll let you know asap.


comment:4 Changed 6 years ago by grenville


Please try running your model with my executable - I have tested that it runs OK for a short time, perhaps you could do the long run.

Go to Model Selection → Compilation and run options→ Compile and run options for Atmosphere… and set:

Directory for the Model executable: /work/n02/n02/grenvill/um/xjfvd/bin
Filename for the Model executable: xjfvd.exe

This is just your model built with cce8.3.3 and linked against a GCOM built the same way. If this solves the OOM problem, we'll address the compiler issue more fully.


comment:5 Changed 6 years ago by pliojop

Hi Grenville,

I have resubmitted the job, I'll update you with how it runs.



comment:6 Changed 6 years ago by pliojop

Hi Grenville,

The job has run overnight and continues to run, well past the previous crash point.

Many thanks for your help. For future reference where do I make the change to my copies of the jobs to ensure that the cce8.3.3 change is made every time?


comment:7 Changed 6 years ago by annette

  • Owner changed from um_support to grenville
  • Status changed from new to assigned

comment:8 Changed 6 years ago by annette

  • Keywords memory, cce, hadgem3 added

comment:9 Changed 6 years ago by grenville


You can build your own model using Cray cce8.3.3 with the following:

Please look at xjfve for the changes you need.

You only need make 2 changes: navigate to model selection→ compilation and run options → UM user override files

enter the gcom_path variable as /work/n02/n02/wmcginty/gcom4.5

navigate to model selection → input/output control.. → user hand edits

and add


Then rebuild the model. This is not the ideal solution - we are working on that.


comment:10 Changed 6 years ago by annette

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.