Opened 2 years ago

Closed 2 years ago

#2202 closed help (answered)

OOM killer during reconfiguration at UM10.7

Reported by: toddj Owned by: um_support
Component: UM Model Keywords: OOM killer recon lib-4205
Cc: Platform: ARCHER
UM Version: 10.7

Description

Suite u-an240@43240 fails during reconfiguration, exiting with either an OOM killer message or this:

lib-4205 : UNRECOVERABLE library error 
  The program was unable to request more memory space.

In this suite, I am quadrupling the idealised model horizontal domain area from a previous suite, which used 24 cores. (1296x1296 —> 2592x2592)x140.

Using 1 or 4 nodes gives the OOM killer.
Using 8 nodes gives the lib-4205.

In both cases, the .astart file reaches a maximum size (93G, which is approximately what I am expecting) in under 5 minutes. Then, nothing additional is produced before the recon step fails after an additional 10 minutes or so.

Please let me know of any suggested fixes or if more information would be helpful. I look forward to your response.

Change History (3)

comment:1 Changed 2 years ago by grenville

Todd

try running with 6x8 or 12x8

Grenville

comment:2 Changed 2 years ago by toddj

Using 2 nodes (6x8) and 4 nodes (12x8 or 8x12): OOM killer

Using 1 large memory node (4x6): OOM killer

However, I did find one configuration that works. Using 8 nodes in a 12x8 decomposition, so 12 MPI tasks per node (96) and 6 tasks per NUMA, the reconfiguration completes successfully in 17.5 minutes.

Does the format of this solution help to tell us about the needs of this computation or if it might be done more efficiently? I'd like to keep doing reconfiguration in the short queue if I can.

comment:3 Changed 2 years ago by grenville

  • Resolution set to answered
  • Status changed from new to closed

Todd

There will be cases where memory requirements force you out of the short queue.

Grenville

Note: See TracTickets for help on using tickets.