Opened 6 years ago

Closed 6 years ago

#1186 closed help (fixed)

MPI errors in HadGEM3 job

Reported by: apm Owned by: willie
Component: UM Model Keywords: NEMO CICE
Cc: Platform: MONSooN
UM Version: 8.0

Description

I am running a HadGEM3 job (UM v8.0, NEMO v3.2) on Monsoon and am trying to increase the number of CPUs to speed the model up. I have found that with certain processor configurations I get a failure at the start of the run with the error message "Invalid communicator (-1) in MPI_Allgather". This message is unfamiliar to me. Do you know what it might mean?

The configurations I have tried are as follows:

Atmos Ocean Comment
10 x 9 1 x 5 The original setup: this runs OK
20 x 9 2 x 5 Runs OK
10 x 9 2 x 5 Runs OK
10 x 9 4 x 5 "Invalid communicator" error
20 x 5 4 x 5 "Invalid communicator" error
20 x 9 2 x 10 "Invalid communicator" error

I have attached the .leave file for the last of these runs. The fact that it is the runs with the larger processor counts for the ocean suggests that the latter is generating the problem. Looking at toyoce.prt0 , I note that the last operation the ocean completes is prism_init_comp_proto.

Job is xjbta

Many thanks,

Alex

Change History (2)

comment:1 Changed 6 years ago by willie

Hi Alex,

You're getting the error message

ERROR: Expected NEMO output files are not all available.
       This may be a UM / OASIS / NEMO start-up problem.
       The ocean.output file may provide more information.

It also gives warnings about there being no NEMO start dump and it seems that a NEMO name list may be missing. The ocean.output file is empty too.

I hope that helps a little.

Regards,

Willie

comment:2 Changed 6 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.