Opened 3 years ago

Closed 3 years ago

#1937 closed help (answered)

UNRECOVERABLE library error

Reported by: anmcr Owned by: willie
Component: UM Model Keywords: unable to request more memory space
Cc: Platform: MONSooN
UM Version: 10.3

Description

Hello,

My job name is u-ae592, which uses Stu Webster's nesting suite, running on MONSooN. My model is failing when it tries to reconfigure the driving model (glm/glm_um_glm_um_recon) with the following error:

lib-4205 : UNRECOVERABLE library error

The program was unable to request more memory space.

The relevant output file can be found at /home/amworr/cylc-run/u-ae592/log/job/20150323T0000Z/glm_um_recon/04/job.out.

Andy Elvidge at the Met Office had a similar problem and on the advise of Stu overcame it by editing the /home/amworr/cylc-run/u-ae592/site/monsoon-cray-xc40/suite-adds.rc file. Specfically the lines

{% set MPI_TASKS_PER_NODE = (NCPU_PER_NODE * HYPERTHREADS / 6*OMP_NUM_THREADS)|int %}
{% set TASKS_PER_NUMA = (MPI_TASKS_PER_NODE / 2)|int %}
{% set RCF_NPROCY = 2 %}
{% set RCF_NPROCX = 3 %}

and

ROSE_LAUNCHER_PREOPTS = -n {{TOTAL_MPI_TASKS}} -N {{MPI_TASKS_PER_NODE}} -S {{TASKS_PER_NUMA}} -d {{OMP_NUM_THREADS}} -j {{HYPERTHREADS}}

Andy recommended setting the numbers 6, 2 and 3 in the former, and removing '-ss' from the later. However, after trying this I am still getting the same error.

Unfortunately Stu Webster is on holiday this week, however I need the model run completed as a priority for a paper resubmission / international talk.

I would be very grateful for any help.

Than you.

Andrew

Attachments (1)

for_willie.PNG (100.6 KB) - added by anmcr 3 years ago.
screen shot of rose for u-af572

Download all attachments as: .zip

Change History (11)

comment:1 Changed 3 years ago by anmcr

sorry, the text in parenthesis in the first line should have said '…(glm/glm_um/glm_um_recon) …'

comment:2 Changed 3 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Andrew,

If your start dump is large you could increase the number of processors from 2x3 to 4x8. The Cray XC40 has 32 processors per node.

Regards
Willie

Last edited 3 years ago by willie (previous) (diff)

comment:3 Changed 3 years ago by anmcr

Hi Willie,

Thanks for the reply.

I couldn't find where this is changed using ROSE. Could you please advise.

Thanks,

Andrew

comment:4 Changed 3 years ago by willie

Hi Andrew,

It's in suite conf > site > monsoon-cray-xc40 > suite-adds.rc. Increasing the number of processors should give each processor less to do, so there is less risk of memory problems.

Willie

comment:5 Changed 3 years ago by anmcr

Hi Willie,

I couldn't find this in rose. See attched screen shot. There was no 'site' option in 'suite conf'. Could you please advise?

Thanks,

Andrew

Changed 3 years ago by anmcr

screen shot of rose for u-af572

comment:6 Changed 3 years ago by willie

Hi Andrew,

Sorry, the above applies to u-ae592. For u-af572, it's in two places under suite conf

  • suite-runtime-dm.rc
  • suite-runtime-lams.rc

I don't know why it's duplicated, so you may need to edit both. It is already 4x8 so you could try 8x8.

Willie

comment:7 Changed 3 years ago by anmcr

Dear Willie,

I tried 4x8, as well as 8x8 and other combinations and the error is still persisting. The job is named u-ag300, and is vn10.4. I would be really grateful if you could look into this further.

(Note that my delay in replying to you was that I reverted to using a vn10.2 job, which does not have this problem, after trying your fixes. However, I have subsequently realised that I need vn10.4.)

Best wishes,

Andrew

comment:8 Changed 3 years ago by anmcr

willie,

i've just realised that i needed to have used “rose suite-run –reload”. let me have a second go at your suggestions …

andrew

comment:9 Changed 3 years ago by anmcr

dear willie,

please close this ticket. i've got the reconfiguration to work. in the end stu webster recommended the setup below, which was successful.

thanks for your help,

andrew

set MPI_TASKS_PER_NODE = (NCPU_PER_NODE * HYPERTHREADS / 8*OMP_NUM_THREADS)|int %}
{% set TASKS_PER_NUMA = (MPI_TASKS_PER_NODE / 2)|int %}
{% set RCF_NPROCY = 2 %}
{% set RCF_NPROCX = 2 %}

comment:10 Changed 3 years ago by grenville

  • Resolution set to answered
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.