#2162 closed help (answered)
Error with coupled model on ARCHER
Reported by: | acc | Owned by: | annette |
---|---|---|---|
Component: | Coupled model | Keywords: | |
Cc: | Platform: | ARCHER | |
UM Version: | 10.6 |
Description
I seem to have come unstuck since early last week. I had the job almost working apart from OOM problems at the end of the month. All my attempts to increase the number of XIOS processors (which should fix this) have failed. In desperation, I repeated the configuration that previously got to the end of the month and now that is failing at start-up. I suspect something has changed following the maintenance session last week but even a completely fresh rebuild still encounters the same problem. Is anyone else having issues?
The error manifests as:
??????????????????????????????????????????????????????????????????????????? ????? ?????????????????????????????? WARNING ?????????????????????????????? ? Warning code: -1 ? Warning from routine: eg_SISL_setcon ? Warning message: Constant gravity enforced ? Warning from processor: 0 ? Warning number: 16 ??????????????????????????????????????????????????????????????????????????? ????? Unable to mmap hugepage 138412032 bytes For file /var/lib/hugetlbfs/global/pagesize-2097152/hugepagefile.MPICH.2.32675.kvs_2 6644336 err Cannot allocate memory with a hint further down that the problem is in oasis: Rank 1872 [Wed May 3 16:52:11 2017] [c1-2c1s5n1] Fatal error in MPI_Send: Other MPI error, error stack: MPI_Send(186)..........................: MPI_Send(buf=0x7ffffffe275c, count=1, MPI_INTEGER, dest=4105, tag=0, comm=0x84000004) failed MPIDI_EagerContigShortSend(273)........: failure occurred while attempting to send an eager message MPID_nem_gni_iStartContigMsg(1426).....: MPID_nem_gni_iSendContig_start(1152)...: MPID_nem_gni_smsg_cm_send_conn_req(691): MPID_nem_gni_smsg_cm_progress_req(217).: MPID_nem_gni_smsg_mbox_alloc(355)......: MPID_nem_gni_smsg_mbox_block_alloc(236): Out of memory Application 26644336 is crashing. ATP analysis proceeding... ATP Stack walkback for Rank 1872 starting: _start@start.S:113 __libc_start_main@libc-start.c:242 nemo_@nemo.f90:18 nemo_gcm$nemogcm_@nemogcm.F90:129 nemo_init$nemogcm_@nemogcm.F90:334 sbc_init$sbcmod_@sbcmod.F90:298 sbc_cpl_init$sbccpl_@sbccpl.F90:968 cpl_define$cpl_oasis3_@cpl_oasis3.F90:285 oasis_enddef$mod_oasis_method_@mod_oasis_method.F90:710 oasis_part_setup$mod_oasis_part_@mod_oasis_part.F90:299 initd_$m_globalsegmap_@m_GlobalSegMap.F90:306 fc_gather_int$m_fccomms_@m_FcComms.F90:146 PMPI_SEND@0x2282604 MPI_Send@0x227f4ef MPIR_Err_return_comm@0x2291b2b handleFatalError@0x22919b9 MPID_Abort@0x22a8a21 abort@abort.c:92 raise@pt-raise.c:42 ATP Stack walkback for Rank 1872 done Process died with signal 6: 'Aborted' Forcing core dumps of ranks 1872, 11, 35, 0, 5977, 1873, 8037
Is it possible that the oasis library needs to be recompiled?
-Andrew
Change History (4)
comment:1 Changed 4 years ago by annette
comment:2 Changed 4 years ago by annette
Hi Andrew,
I have recompiled from scratch a coupled model which uses the same OASIS build you are using, and it looks to have run OK. So I don't know what has happened with your suite. Also the N512-ORCAO25 suite yours is based on seems to be running (well it is failing for different reasons).
Have you had the same error more than once? Is it worth re-running? Let me know if you hear anything back from ARCHER. We have had trouble in the past with the fact that modules don't exactly define the environment, but I thought that OASIS was only sensitive to the mpich version.
Annette
comment:3 Changed 4 years ago by grenville
- Resolution set to answered
- Status changed from new to closed
Closed for lack of activity
comment:4 Changed 3 years ago by annette
For reference, the issue was solved by the Met Office by removing these lines from the suite:
MPICH_GNI_MAX_EAGER_MSG_SIZE=65536 MPICH_GNI_MAX_VSHORT_MSG_SIZE=8192
I am investigating…
Annette