Opened 6 years ago
Closed 6 years ago
#1443 closed help (fixed)
8.4 Run Crash Problem
Reported by: | pliojop | Owned by: | grenville |
---|---|---|---|
Component: | UM Model | Keywords: | memory, cce, hadgem3 |
Cc: | Platform: | ARCHER | |
UM Version: | 8.4 |
Description
Hi,
I have a version 8.4 job (xkmwb) running on Archer. It is a copy of Grenville's Archer test job, xgwtq (HadGEM3 atmosphere only with GA4). At present it runs for about 5 hours,about 6 months worth of model time before crashing with the error:
UM Executable : /work/n02/n02/japope/um/xkmwb/bin/xkmwb.exe
*
mkdir:: File exists
[NID 04011] 2015-01-19 16:37:25 Apid 12632862: initiated application termination
[NID 04011] 2015-01-19 16:37:28 Apid 12632862: OOM killer terminated this process.
xkmwb: Run failed
I had already encountered this error and had changed the job from running on 6 NS processors to 12 processors based on a review of other tickets with OOM errors. However the error still persisted.
There is another error in the file, which may be linked the recurrence of:
WARNING *
Conservation enforcement failed
Run continuing using best estimate
WARNING *
non-conservation for field 4
in the leave file. Not sure if they are linked in anyway.
The leave file for this run is
/home/n02/n02/japope/output/xkmwb000.xkmwb.d15019.t113751.leave
Thanks
James
Change History (10)
comment:1 Changed 6 years ago by grenville
comment:2 Changed 6 years ago by pliojop
Hi Grenville,
Done,
James
comment:3 Changed 6 years ago by grenville
James
We had some problems with OOM errors with an 8.2 model built with cce8.2.1. A rebuild of the model with cce8.3.3 resolved that problem. I am testing your model under cce8.3.3. I'll let you know asap.
Grenville
comment:4 Changed 6 years ago by grenville
James
Please try running your model with my executable - I have tested that it runs OK for a short time, perhaps you could do the long run.
Go to Model Selection → Compilation and run options→ Compile and run options for Atmosphere… and set:
Directory for the Model executable: /work/n02/n02/grenvill/um/xjfvd/bin
Filename for the Model executable: xjfvd.exe
This is just your model built with cce8.3.3 and linked against a GCOM built the same way. If this solves the OOM problem, we'll address the compiler issue more fully.
Grenville
comment:5 Changed 6 years ago by pliojop
Hi Grenville,
I have resubmitted the job, I'll update you with how it runs.
Thanks
James
comment:6 Changed 6 years ago by pliojop
Hi Grenville,
The job has run overnight and continues to run, well past the previous crash point.
Many thanks for your help. For future reference where do I make the change to my copies of the jobs to ensure that the cce8.3.3 change is made every time?
James
comment:7 Changed 6 years ago by annette
- Owner changed from um_support to grenville
- Status changed from new to assigned
comment:8 Changed 6 years ago by annette
- Keywords memory, cce, hadgem3 added
comment:9 Changed 6 years ago by grenville
James
You can build your own model using Cray cce8.3.3 with the following:
Please look at xjfve for the changes you need.
You only need make 2 changes: navigate to model selection→ compilation and run options → UM user override files
enter the gcom_path variable as /work/n02/n02/wmcginty/gcom4.5
navigate to model selection → input/output control.. → user hand edits
and add
/home/willie/hand_edits/remove_loadcomp.ed
Then rebuild the model. This is not the ideal solution - we are working on that.
Grenville
comment:10 Changed 6 years ago by annette
- Resolution set to fixed
- Status changed from assigned to closed
James
Please let us have read permission in your space - please type at the ARCHER command line:
chmod -R g+rX /home/n02/n02/japope
chmod -R g+rX /work/n02/n02/japope
Grenville