Opened 2 years ago

Closed 2 years ago

Last modified 18 months ago

#2162 closed help (answered)

Error with coupled model on ARCHER

Reported by: acc Owned by: annette
Component: Coupled model Keywords:
Cc: Platform: ARCHER
UM Version: 10.6

Description

I seem to have come unstuck since early last week. I had the job almost working apart from OOM problems at the end of the month. All my attempts to increase the number of XIOS processors (which should fix this) have failed. In desperation, I repeated the configuration that previously got to the end of the month ­and now that is failing at start-up. I suspect something has changed following the maintenance session last week but even a completely fresh rebuild still encounters the same problem. Is anyone else having issues?

The error manifests as:

???????????????????????????????????????????????????????????????????????????
?????
??????????????????????????????      WARNING
??????????????????????????????
?  Warning code: -1
?  Warning from routine: eg_SISL_setcon
?  Warning message:  Constant gravity enforced
?  Warning from processor: 0
?  Warning number: 16
???????????????????????????????????????????????????????????????????????????
?????

Unable to mmap hugepage 138412032 bytes
For file
/var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.2.32675.kvs_2
6644336 err Cannot allocate memory


with a hint further down that the problem is in oasis:

Rank 1872 [Wed May  3 16:52:11 2017] [c1-2c1s5n1] Fatal error in MPI_Send:
Other MPI error, error stack:
MPI_Send(186)..........................: MPI_Send(buf=0x7ffffffe275c,
count=1, MPI_INTEGER, dest=4105, tag=0, comm=0x84000004) failed
MPIDI_EagerContigShortSend(273)........: failure occurred while attempting
to send an eager message
MPID_nem_gni_iStartContigMsg(1426).....:
MPID_nem_gni_iSendContig_start(1152)...:
MPID_nem_gni_smsg_cm_send_conn_req(691):
MPID_nem_gni_smsg_cm_progress_req(217).:
MPID_nem_gni_smsg_mbox_alloc(355)......:
MPID_nem_gni_smsg_mbox_block_alloc(236): Out of memory
Application 26644336 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 1872 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  nemo_@nemo.f90:18
  nemo_gcm$nemogcm_@nemogcm.F90:129
  nemo_init$nemogcm_@nemogcm.F90:334
  sbc_init$sbcmod_@sbcmod.F90:298
  sbc_cpl_init$sbccpl_@sbccpl.F90:968
  cpl_define$cpl_oasis3_@cpl_oasis3.F90:285
  oasis_enddef$mod_oasis_method_@mod_oasis_method.F90:710
  oasis_part_setup$mod_oasis_part_@mod_oasis_part.F90:299
  initd_$m_globalsegmap_@m_GlobalSegMap.F90:306
  fc_gather_int$m_fccomms_@m_FcComms.F90:146
  PMPI_SEND@0x2282604
  MPI_Send@0x227f4ef
  MPIR_Err_return_comm@0x2291b2b
  handleFatalError@0x22919b9
  MPID_Abort@0x22a8a21
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 1872 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 1872, 11, 35, 0, 5977, 1873, 8037

Is it possible that the oasis library needs to be recompiled?

-Andrew

Change History (4)

comment:1 Changed 2 years ago by annette

I am investigating…

Annette

comment:2 Changed 2 years ago by annette

Hi Andrew,

I have recompiled from scratch a coupled model which uses the same OASIS build you are using, and it looks to have run OK. So I don't know what has happened with your suite. Also the N512-ORCAO25 suite yours is based on seems to be running (well it is failing for different reasons).

Have you had the same error more than once? Is it worth re-running? Let me know if you hear anything back from ARCHER. We have had trouble in the past with the fact that modules don't exactly define the environment, but I thought that OASIS was only sensitive to the mpich version.

Annette

comment:3 Changed 2 years ago by grenville

  • Resolution set to answered
  • Status changed from new to closed

Closed for lack of activity

comment:4 Changed 18 months ago by annette

For reference, the issue was solved by the Met Office by removing these lines from the suite:

MPICH_GNI_MAX_EAGER_MSG_SIZE=65536
MPICH_GNI_MAX_VSHORT_MSG_SIZE=8192
Note: See TracTickets for help on using tickets.