Opened 7 months ago

Last modified 6 months ago

#3495 reopened help

out-of-memory (OOM) on ARCHER2 N216 run, 1152 cores

Reported by: pmcguire Owned by: annette
Component: ARCHER2 Keywords: ARCHER2, UM
Cc: Platform: ARCHER2
UM Version:

Description

Hi CMS Helpdesk:
I got the CSSP China PORCELAIN UM11.5 suite to run for 10-20 minutes or so on ARCHER2, before it crashes because it runs out of memory (N216, 1152 cores, 48x24 decomposition).
I suppose I probably made a mistake somewhere? Or maybe there is some more serious problem?
Any suggestions?

I have tried several times, with different domain decompositions, and 1-2 attempts for each, in different suites:
u-cc629long, u-cc629long2, and u-cc629long3.
Patrick

Change History (15)

comment:1 Changed 7 months ago by annette

Patrick,

We had memory problems with the N1280 model, related to this issue with MPICH:
https://docs.archer2.ac.uk/known-issues/#memory-leak-leads-to-job-fail-by-out-of-memory-oom-error

What worked for me was loading the UCX MPICH at runtime (no need to recompile). In the site/archer2.rc file I have the following pre-script (instead of loading the um modulefile):

    [[HPC]]
        pre-script = """
                     ulimit -s unlimited
                     TOMP_NUM_THREADS=${OMP_NUM_THREADS:-}
                     module load epcc-job-env
                     module load cray-hdf5-parallel
                     module load cray-netcdf-hdf5parallel
                     module unload craype-network-ofi
                     module unload cray-mpich
                     module load craype-network-ucx
                     module load cray-mpich-ucx
                     module load libfabric
                     module list 2>&1
                     export OMP_NUM_THREADS=$TOMP_NUM_THREADS
                     ""

You may need to adapt for your suite.

Annette

comment:2 Changed 7 months ago by pmcguire

Thanks, Annette!

I am trying your nice suggestion now.

Patrick

comment:3 Changed 7 months ago by pmcguire

Hi CMS Helpdesk:
I tried Annette's suggestion.

The new error that I get is as follows. Any suggestions?
Patrick

from ~pmcguire/cylc-run/u-cc629long/log/job/19880901T0000Z/atmos_main/04/job.err :

Unloading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env-profile
Unloading bolt/0.7
Loading cpe-cray
Loading cce/10.0.4
Loading craype/2.7.2
Loading craype-x86-rome
Loading libfabric/1.11.0.0.233
Loading craype-network-ofi
Loading cray-dsmml/0.1.2
Loading perftools-base/20.10.0
Loading xpmem/2.2.35-7.0.1.0_1.9__gd50fabf.shasta
Loading cray-mpich/8.0.16
Loading cray-libsci/20.10.1.2
Loading bolt/0.7
Loading /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Loading /usr/local/share/epcc-module/epcc-module-loader

Loading epcc-job-env
  Loading requirement: bolt/0.7
    /work/y07/shared/archer2-modules/modulefiles-cse/epcc-setup-env
Unloading craype-network-ofi
  Unloading useless requirement: libfabric/1.11.0.0.233
Loading craype-network-ucx
  Loading requirement: cray-ucx/default
[WARN] file:STASHC: skip missing optional source: namelist:exclude_package(:)
[WARN] file:ATMOSCNTL: skip missing optional source: namelist:jules_urban2t_param
[WARN] file:RECONA: skip missing optional source: namelist:trans(:)
[WARN] file:IDEALISE: skip missing optional source: namelist:idealised
[WARN] file:RECONA: skip missing optional source: namelist:ideal_free_tracer(:)
[WARN] file:IOSCNTL: skip missing optional source: namelist:lustre_control
[WARN] file:IOSCNTL: skip missing optional source: namelist:lustre_control_custom_files
[WARN] file:RECONA: skip missing optional source: namelist:recon_idealised
[WARN] file:SHARED: skip missing optional source: namelist:jules_urban_switches
Wed Mar 17 06:06:05 2021: [PE_896]:inet_connect:socket error state No route to host
Wed Mar 17 06:06:05 2021: [PE_896]:_pmi_inet_setup:inet_connect failed
Wed Mar 17 06:06:08 2021: [PE_896]:inet_connect:socket error state No route to host
Wed Mar 17 06:06:08 2021: [PE_896]:_pmi_inet_setup:inet_connect failed
Wed Mar 17 06:06:12 2021: [PE_896]:inet_connect:socket error state No route to host
Wed Mar 17 06:06:12 2021: [PE_896]:_pmi_inet_setup:inet_connect failed
Wed Mar 17 06:06:12 2021: [PE_896]:_pmi_init:_pmi_inet_setup (full) returned -1
MPICH ERROR [Rank 0] [job id unknown] [Wed Mar 17 06:06:12 2021] [unknown] [nid001383] - Abort(591119) (rank 0 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(632):
MPID_Init(286).......:  PMI2 init failed: 1

aborting job:
Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(632):
MPID_Init(286).......:  PMI2 init failed: 1 

srun: error: nid001383: task 896: Exited with exit code 255
srun: Terminating job step 156839.0
slurmstepd: error: *** STEP 156839.0 ON nid001054 CANCELLED AT 2021-03-17T06:06:12 ***
srun: error: nid001383: tasks 897-1023: Terminated
srun: error: nid001059: tasks 640-767: Terminated
srun: error: nid001055: tasks 128-255: Terminated
srun: error: nid001057: tasks 384-511: Terminated
srun: error: nid001056: tasks 256-383: Terminated
srun: error: nid001054: tasks 0-127: Terminated
srun: error: nid001058: tasks 512-639: Terminated
srun: error: nid001087: tasks 768-895: Terminated
srun: error: nid001386: tasks 1024-1151: Terminated
srun: Force Terminated job step 156839.0
[FAIL] um-atmos # return-code=143
Received signal ERR
cylc (scheduler - 2021-03-17T06:06:13Z): CRITICAL Task job script received signal ERR at 2021-03-17T06:06:13Z
cylc (scheduler - 2021-03-17T06:06:13Z): CRITICAL failed at 2021-03-17T06:06:13Z

comment:4 Changed 7 months ago by annette

Patrick,

These errors look like a problem with the network. Archer2 did a maintenance session on Thurs 18 March (after your run) to fix some network issues. I think you should re-submit the run and see if it works now.

Annette

comment:5 Changed 7 months ago by pmcguire

Hi Annette:
After that network problem, I did rerun the suite with your new module file suggestions for fixing the Out of Memory (OOM) problem, described above. It turns out that it still has the OOM problem when I rerun it. See:
~pmcguire/cylc-run/u-cc629long/log/job/19880901T0000Z/atmos_main/05/job.err

I have also tried to rerun the whole suite, with recompiling from scratch, as a new suite:
~pmcguire/cylc-run/u-cc629long4
But that doesn't seem to have helped.
Any suggestions for how to fix this?
Patrick

comment:6 Changed 6 months ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to assigned

Hi Patrick,

I think you just need to give the model more memory by increasing the number of nodes.

Did you run this model on Archer? If so, you can work out the memory that you used there, and then run something equivalent on Archer2.

For example, a 48x24 MPI decomposition on Archer using 24 cores/node would run on 48 nodes giving a total of 3072 GB. The same decomposition on Archer2 with 128 cores/node runs on 9 nodes for a total of 2304 GB. You got 2.7 GB memory/core on Archer, but only 2 GB memory/core on Archer2.

Another thing to try is OpenMP threads. If you used 2 threads but kept the MPI decomposition the same, you would double the available memory per MPI task.

Annette

comment:7 Changed 6 months ago by pmcguire

Hi Annette:
Thanks!
I am trying again now. I have changed OMPTHR_ATM and OMPTHR_RCF both from 1 to 2. The atmos_main app for u-cc629long has been retriggered and is now submitted and pending.
Patrick

comment:8 Changed 6 months ago by pmcguire

Hi Annette:
After changing OMPTHR_ATM and OMPTHR_RCF both from 1 to 2, I resubmitted, and it ended up doing an Out-of-memory (OOM) Kill again. This time ~pmcguire/cylc-run/u-cc629long/log/job/19880901T0000Z/atmos_main/09/job.out, it was after 1919 timesteps, whereas the max number of timesteps I could previously get it to go without crashing with ( OMPTHR_ATM=1) (see: ~pmcguire/cylc-run/u-cc629long/log/job/19880901T0000Z/atmos_main/03/job.out) was (almost) exactly half of this, at 959 timesteps.

If it crashes after double the timesteps after doubling the OMPTHR_ATM (as you suggest, which doubles the available memory), maybe this could be indicative of a memory leak? Maybe the memory usage grows linearly with timesteps.

To answer your previous question, yes, I did get this model to run previously on ARCHER. And yes, it was with a 48x24 MPI decomposition on ARCHER.

I thought that by using your second option, of doubling the OpenMP threads, then that would double the memory available. I guess by depopulating the nodes (your first suggestion?) by doing it with the same MPI decomposition over 18 nodes (instead of 9), this would also effectively double the memory available per core (assuming OpenMP threads is left at 1). So I guess those two might have the equivalent number of timesteps reachable.

I guess I could try doubling the memory/processor again (quadrupling, in total).
I don't know if it is better or more economical to try 48x24 with (a) 18 nodes and OMPTHR_ATM=2, or (b) 36 nodes and OMPTHR_ATM=1. I will try option (a) first, right now.
Patrick

Last edited 6 months ago by pmcguire (previous) (diff)

comment:9 Changed 6 months ago by pmcguire

Hi Annette:
Option (a) 18 nodes (instead of 9 nodes) and OMPTHR_ATM=OMPTHR_RCF=2 (instead of 1)
seems to have worked! Thank you!
It ran successfully (without OOM'ing/crashing) through a 1 month test run.
I don't know if this setup is ideal or economical though.
The suite name is (u-cc629long).

I am trying a 20.5-year run now (u-cc629longer), with resubmission every 1.5 years (18 months). I don't know if it can make it through an 18-month cycle or not.
Patrick

comment:10 Changed 6 months ago by annette

Hi Patrick,

I'm glad you have something working.

My response before probably wasn't very clear, but I was suggesting you try option a) 48x24 with 2 OMP threads for 18 nodes.

Basically you need to increase the number of nodes to increase the total memory available to the model, but there are different strategies:

  • Depopulate the nodes (use fewer cores/node)
  • Use OpenMP threads (this should be more efficient than just de-populating as you are still using the cores for something).
  • Use more MPI tasks.

All you can really do is experiment until you have something that works. It's pretty reasonable to use 2 OpenMP threads though - most configurations are set up like this. And if you wanted to use the IO server you will need to have at least 2 threads.

Whether the setup you have now will work for a longer run, I don't know. Did you run for 18 month cycles on Archer?

We have had a lot of trouble with OOMs with the N1280 model, on Archer as well as Archer2. It could well be that there are memory inefficiencies in the model but we have never got to the bottom of it.

Annette

comment:11 Changed 6 months ago by pmcguire

Hi Annette:
Thanks! Option a) did indeed also work for an 18 month cycle.
It's on its 2nd 18-month cycle now.

I think it might now be 2 times faster than it was on ARCHER, or thereabout. Maybe using 2 OMP threads is responsible for that, largely.
Patrick

comment:12 Changed 6 months ago by annette

  • Resolution set to fixed
  • Status changed from assigned to closed

Hi Patrick,

That's great, and suggests you are running an efficient configuration too. I will close the ticket now.

Annette

comment:13 Changed 6 months ago by pmcguire

Hi Annette:
Thanks for all of your help with this.

But maybe this ticket was closed a bit early?

It crashed with an Out-of-memory (OOM) kill, during the 2nd 18-month cycle, after 3.5 months after the 61919th time step.

I am not sure what was different than from the 1st 18-month cycle, which was successful.
Patrick

comment:14 Changed 6 months ago by annette

  • Resolution fixed deleted
  • Status changed from closed to reopened

Hi Patrick,

You can print the memory usage of the model as it runs, however actually diagnosing and fixing a memory problem could be tricky. So you may decide just to shorten the cycle length, or increase the number of nodes again, so that you can get your runs to complete.

There is an option in the model, print_mem_info, which reports the node using the most memory and the volume of memory it's using, at each timestep.

I have a different version of this code in a branch:

branches/dev/annetteosprey/vn11.5_memory_info_all_nodes@96221

This prints the memory usage of all nodes, but only once per model day. You need to have the print_mem_info switch on to use this.

Annette

comment:15 Changed 6 months ago by pmcguire

Hi Annette:
with 2 OMP Threads and 18 nodes, it seems to crash with OOM at somewhat random times, either when I do 18 month cycles or 3 month cycles. It makes it through more 3 month cycles without OOMing than it does for the 18 month cycles.

I am trying now to use 2 OMP threads and 36 nodes, upping from 18 to 36 nodes.

Thanks for the advice about printing the memory info! I hope to try that sometime soon, especially if 36 nodes doesn't work much more robustly than 18 nodes.
Patrick

Note: See TracTickets for help on using tickets.