Opened 5 years ago

Closed 5 years ago

#1338 closed error (fixed)

ARCHER vn8.2 job hangs on normal completion

Reported by: swr05npk Owned by: um_support
Component: UM Model Keywords: hang, ARCHER
Cc: Platform: ARCHER
UM Version: 8.2

Description

I have been testing an aquaplanet UM configuration on ARCHER at vn8.2. At the end of the job, the MPI environment hangs until I terminate it, or until the wallclock time expires. This happens regardless of whether the job stops because of an error or because the job completes normally.

My job is xiurl. On Friday, it ran three months in 2.5 hours, then continued running for a further 2.5 hours (producing no further output) until I killed it. You can see the .leave file at

~pappas/output/xiurl000.xiurl.d14220.t122226.leave

but it is not particularly helpful because I terminated the job. The individual PE output files are

/work/n02/n02/pappas/um/xiurl/pe_output

PE 0 says

MPPIO: Shutdown IO

*******************************************************************************
***************** End of UM RUN Job : 15:25:49 on 08/08/2014 ******************
*******************************************************************************

Process     0 has exited.

yet the job was still running at 17:36 (2+ hours later) when I terminated it.

This is the first time I have used 8.2, so I don't know if this issue is specific to the aquaplanet configuration. But MPI hangs appear to be an known issue with vn8.2

http://collab.metoffice.gov.uk/twiki/pub/Support/CodeDevelopment/UM8.2ReleaseNotes-594715l9t39032o573.html

"Due to changes to the MPI communicators in the model, if a job crashes the scheduler now takes some responsibility for terminating the remaining processes. This can cause a failed job to hang for a time after the initial crash: e.g. up to ten minutes under Loadleveler on the IBM or indefinitely with 'at' on Linux. If a job crashes, check for any remaining processes that may require terminating."

That doesn't say anything about jobs that complete normally, however.

Change History (5)

comment:1 Changed 5 years ago by willie

Hi Nick,

I think you may have run out of disk quota on /home and on /work. Try

 du -mshc /home/n02/n02/pappas/*
 du -mshc /work/n02/n02/pappas/*

Regards,

Willie

comment:2 Changed 5 years ago by swr05npk

Hi Willie,

I don't think that can be it. I am using 13GB on /home (quota = 20GB) and 1.5TB on /work (quota = 5TB).

Cheers,
Nick

comment:3 Changed 5 years ago by willie

Hi Nick,

I switched post processing off (Post Proc > Main Switch) and that solved the problem. If you need post processing you need to add the branches in http://cms.ncas.ac.uk/wiki/Archer/NercArchiving.

Regards

Willie

comment:4 Changed 5 years ago by annette

Hi Nick,

I assume this solved your problem so I am going to close the ticket. Do please get in touch if you have any further queries.

Regards,
Annette

comment:5 Changed 5 years ago by annette

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.