Opened 11 years ago

Closed 11 years ago

#219 closed error (fixed)

MPI problem on hector

Reported by: sws04jc Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: 6.1

Description

Dear CMS,

I am running a UM vn6.1 LAM job with tracers, 76 levels, 1km horizontal grid length (448 by 352) —- job id xdove, user sws04jc.

The job will not run if I attempt to use more than 32 processors, in spite of the fact that I've attached the modset hector_io. The error message begins "MPICH PtlEQPoll error (PTL_EQ_DROPPED)".

With 32 processors or less, the job runs fine.

I can run 12km and 4km on hector successfully with more processors, no problem.

Solving this problem is not an absolutely essential and urgent priority, but I thought you should know about it.

Thanks in advance for any advice you can offer.

Sincerely,
Jeffrey Chagnon

Change History (3)

comment:1 in reply to: ↑ description ; follow-up: Changed 11 years ago by sws04jc

Willie has pointed out that the hector project webpage recommends increasing the size of the env variable MPICH_PTL_OTHER_EVENTS.

I'm going to try increasing this (from 2048 to 4096).

Finger crossed.

Jeffrey

comment:2 in reply to: ↑ 1 Changed 11 years ago by sws04jc

And I'd also add that I now see this information posted to the cms webpage under HPC FAQs.

Sorry to have used this space to talk to myself!

If this doesn't fix the problem, then I will get back in touch.

comment:3 Changed 11 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.