Opened 11 years ago

Closed 11 years ago

#248 closed error (fixed)

MPI_BSEND error

Reported by: ggxjd Owned by: lois
Component: UM Model Keywords:
Cc: j.cole@… Platform:
UM Version:

Description

Hi,
Can anyone shed any light on the following error message? It's LAM run. I'm aware that this has something to do with the GCOM interface library but Have no idea how to sort it out. How can I increase the size of this buffer?

Cheers
Jonny

0 - MPI_BSEND : Insufficent space available in user-defined buffer
[0] [] Aborting Program!
mpiexec: Warning: accept_abort_conn: MPI_Abort from IP 10.142.0.2, rank 321, killing all.
mpiexec: Warning: tasks 0-7 died with signal 15 (Terminated).

120.68s real 0.01s user 0.01s system

diff: /exports/gpfsbig/work/bristol/ggxjd/um/tcqik/tmp/tcqik.xhist: No such file or directory
qsmain: Copying /exports/gpfsbig/work/bristol/ggxjd/um/tcqik/dataw/tcqik.thist to backup thist file /exports/gpfsbig/work/bristol/ggxjd/um/tcqik/dataw/tcqik.thist_keep
tcqik: Run failed

Change History (4)

comment:1 Changed 11 years ago by willie

Hi Jonny,

Issues relating to MPI problems are discussed at http://ncas-cms.nerc.ac.uk/content/view/1412/43/ on the CMS web page and at http://www.hector.ac.uk/support/cse/Resources/faq/#Error_messages_and_debugging5 on the HECToR page.

It is useful to do a consistency check on the model set-up. On the STASH diagnostics page, do a verify (Ctl V) and correct any reported problems: there are a few that need to be marked as "Not Included".

In the UMUI, go to Sub Model Independent > Post processing > Initialization and Post Processing of mean and standard PP files. Ensure that the "Unpacked, profile 0" button is highlighted and that each entry in the table has packing profile zero. Failure to do this will result in illegal numbers ("Not a Number" or Nan) in the diagnostic STASH output.

Run the UNUI's check setup to make sure that there are no errors reported.

In version 6.1 of the UM we have found it necessary to include the hector_io.mf77 mod set when using large numbers of processors. It may be worth trying this.

I hope this helps.

Regards,

Willie

comment:2 Changed 11 years ago by lois

  • Cc j.cole@… added
  • Owner changed from um_support to lois
  • Status changed from new to assigned

Although Willie is right, you need to check your setup, I think this is really a problem with running UM vn 4.5 at high resolution on the Bristol cluster.

If you were working on HECToR we have set up some versions of the GCOM library built with bigger buffers just for these high resolution cases. So I would have advised you to create a compile override file which would link with these GCOM libraries rather than the standard one.

You need to check with the person who installed UM vn 4.5 on your cluster what GCOM libraries are available. You might also ask if they are able to build another version with a big buffer for your case, it is relatively simple to do and Jeff Cole could offer advice. Could you also find out what version of GCOM you are using at Bristol as this will help with giving advice?

Lois

comment:3 Changed 11 years ago by ggxjd

Thanks guys. A new library with a bigger buffer has been compiled and I have linked to it via the compilation overrides window and seems to work fine. I have one question though. I saw on another ticket a response to the same problem. With the compile overide file having the following syntax:

@load LCOM_PATH=-L. -L$(UMDIR)/gcom/um5.5/lib -L$(UMDIR)/lib @load LCOM_LIBS=-lgcom_mpi_buffered_bigbuff -lvect -lmass_9.1 -lmassv

Mine is obviously different e.g. path and filename etc. Do I need to include these -lvect, -lmass_9.1 and -lmassv terms though?

Cheers
Jonny

comment:4 Changed 11 years ago by lois

  • Resolution set to fixed
  • Status changed from assigned to closed

The load options will depend on what computer you are using. The load options you refer to I think is for the UM running on HPCx.
You just need to replicate what is in the load options in the compile_vars file for the Bristol cluster.

If it works that is great, so I will close this query.

Lois

Note: See TracTickets for help on using tickets.