Opened 4 years ago

Closed 4 years ago

#1549 closed help (fixed)

killed jobs

Reported by: ggxmy Owned by: um_support
Component: UM Model Keywords:
Cc: gmann Platform: ARCHER
UM Version: 8.4

Description

Dear CMS,

My recent UM jobs, tdwpm and tdwpn, were killed before they finish their 4th simulation month (March 2008). I can't find information about the causes of failures in the .leave files ( tdwpm000.tdwpm.d15117.t144512.leave and tdwpn000.tdwpn.d15117.t144512.leave in /work/n02/n02/masara/um/output/ ). Could you help me find and fix the problem? Thank you.

Regards,
Masaru Yoshioka

Change History (2)

comment:1 Changed 4 years ago by grenville

Masaru

The leave file indicates on OOM (out of memory) error

[NID 00063] 2015-04-28 15:33:22 Apid 13838853: OOM killer terminated this process.

. We have seen this intermittently (in quite long runs) and the solution which appears to work is to rebuild the model with the Cray cce8.3.3 complier.

This will entail linking with a GCOM library built with the same compiler.

ARCHER change default software quite quickly and we struggle to keep up - the default compiler is cce8.3.7.

I shall get back to you with some instructions for how best to build with cce8.3.3.

Grenville

comment:2 Changed 4 years ago by annette

  • Resolution set to fixed
  • Status changed from new to closed

This was answered offline. For reference see: #1485

Note: See TracTickets for help on using tickets.