Opened 11 years ago

Closed 10 years ago

#433 closed help (fixed)

HiGEM job crash on Hector

Reported by: sobaux Owned by: lois
Component: HECToR Keywords:
Cc: Platform:
UM Version: 6.1

Description

Hi,
I've been running a HiGEM job on Hector, been running Ok resubmitting in 3 month chunks taking about 11 hours. A job failed today 2.75 hours into the run with the error '[NID 12577] 2010-05-25 14:25:59 Apid 1934277 killed. Received node failed or halted event for nid 895'

I've not seen this before and nothing appears wrong in the most recent dumps or means. Also, no core file was produced.

Job is xepgn, run in directory /work/n02/n02/sobaux/xepgn and output is xepgn006.xepgn.d10145.t010326.leave in
~sobaux/um/umui_out.

Is it possibly a hardware fault ?

Thanks

Ian

Change History (2)

comment:1 Changed 11 years ago by lois

  • Owner changed from um_support to lois
  • Status changed from new to assigned

Hello Ian,

I suspect this is a hardware fault. I wish HECToR would put more detailed information about these single node failures on their web pages, they are in their monthly reports but these are usually too late for users. So I would just try again if you can strugglethrough the very long queues at the moment.

You may be able to get your AUs back if you complain to the HECToR helpdesk.

Lois

comment:2 Changed 10 years ago by lois

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.