Opened 4 years ago

Closed 4 years ago

#1443 closed help (fixed)

8.4 Run Crash Problem

Reported by: pliojop Owned by: grenville
Component: UM Model Keywords: memory, cce, hadgem3
Cc: Platform: ARCHER
UM Version: 8.4

Description

Hi,

I have a version 8.4 job (xkmwb) running on Archer. It is a copy of Grenville's Archer test job, xgwtq (HadGEM3 atmosphere only with GA4). At present it runs for about 5 hours,about 6 months worth of model time before crashing with the error:


UM Executable : /work/n02/n02/japope/um/xkmwb/bin/xkmwb.exe
*

mkdir:: File exists
[NID 04011] 2015-01-19 16:37:25 Apid 12632862: initiated application termination
[NID 04011] 2015-01-19 16:37:28 Apid 12632862: OOM killer terminated this process.
xkmwb: Run failed


I had already encountered this error and had changed the job from running on 6 NS processors to 12 processors based on a review of other tickets with OOM errors. However the error still persisted.

There is another error in the file, which may be linked the recurrence of:


WARNING *

Conservation enforcement failed
Run continuing using best estimate

WARNING *

non-conservation for field 4


in the leave file. Not sure if they are linked in anyway.

The leave file for this run is

/home/n02/n02/japope/output/xkmwb000.xkmwb.d15019.t113751.leave

Thanks

James

Change History (10)

comment:1 Changed 4 years ago by grenville

James

Please let us have read permission in your space - please type at the ARCHER command line:

chmod -R g+rX /home/n02/n02/japope
chmod -R g+rX /work/n02/n02/japope

Grenville

comment:2 Changed 4 years ago by pliojop

Hi Grenville,

Done,

James

comment:3 Changed 4 years ago by grenville

James

We had some problems with OOM errors with an 8.2 model built with cce8.2.1. A rebuild of the model with cce8.3.3 resolved that problem. I am testing your model under cce8.3.3. I'll let you know asap.

Grenville

comment:4 Changed 4 years ago by grenville

James

Please try running your model with my executable - I have tested that it runs OK for a short time, perhaps you could do the long run.

Go to Model Selection → Compilation and run options→ Compile and run options for Atmosphere… and set:

Directory for the Model executable: /work/n02/n02/grenvill/um/xjfvd/bin
Filename for the Model executable: xjfvd.exe

This is just your model built with cce8.3.3 and linked against a GCOM built the same way. If this solves the OOM problem, we'll address the compiler issue more fully.

Grenville

comment:5 Changed 4 years ago by pliojop

Hi Grenville,

I have resubmitted the job, I'll update you with how it runs.

Thanks

James

comment:6 Changed 4 years ago by pliojop

Hi Grenville,

The job has run overnight and continues to run, well past the previous crash point.

Many thanks for your help. For future reference where do I make the change to my copies of the jobs to ensure that the cce8.3.3 change is made every time?

James

comment:7 Changed 4 years ago by annette

  • Owner changed from um_support to grenville
  • Status changed from new to assigned

comment:8 Changed 4 years ago by annette

  • Keywords memory, cce, hadgem3 added

comment:9 Changed 4 years ago by grenville

James

You can build your own model using Cray cce8.3.3 with the following:

Please look at xjfve for the changes you need.

You only need make 2 changes: navigate to model selection→ compilation and run options → UM user override files

enter the gcom_path variable as /work/n02/n02/wmcginty/gcom4.5

navigate to model selection → input/output control.. → user hand edits

and add

/home/willie/hand_edits/remove_loadcomp.ed

Then rebuild the model. This is not the ideal solution - we are working on that.

Grenville

comment:10 Changed 4 years ago by annette

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.