Opened 3 months ago

Closed 3 months ago

#2429 closed help (fixed)

Intermittent segmentation fault in UKV model

Reported by: nx902220 Owned by: willie
Priority: normal Component: UM Model
Keywords: STASH, segmentation fault Cc:
Platform: Monsoon2 UM Version: 10.5

Description

I'm trying what you suggest in #2387 comment 13. It is taking a long time because it keeps failing at ukv_um_fcst with the segmentation error we saw before. I keep re-triggering and in the past it would work after 4 re-triggers but I am on 6. Each time I re-trigger it queues all day and does not run until the night. So it has taken me a week so far. Is there a way of making this quicker?

Change History (4)

comment:1 Changed 3 months ago by willie

Just to summarise the error, we're getting a segmanetation fault at the 60th time step after outputting some STASH,

[71] exceptions: An exception was raised:11 (Segmentation fault)
[71] exceptions: the exception reports the extra information: Address not mapped to object.
[71] exceptions: whilst in a serial region
[71] exceptions: Task had pid=1791 on host nid00877
[71] exceptions: Program is "/home/d04/lblunn/cylc-run/u-at199/share/fcm_make/build-atmos/bin/um-atmos.exe"
[71] exceptions: calling registered handler @ 0x20019d80
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[71] exceptions: Done callbacks
[71] exceptions: *** GLIBC ***
[71] exceptions: Data address (si_addr): 0x10013a01000; rip: 0x24630940
[71] exceptions: [backtrace]: has   6 elements:
[71] exceptions: [backtrace]: (  1) : Address: [0x24630940] 
[71] exceptions: [backtrace]: (  1) : __cray_dcopy_HSW (* Cannot Locate *)
[71] exceptions: [backtrace]: (  2) : Address: [0x2001c5ca] 
[71] exceptions: [backtrace]: (  2) : signal_do_backtrace_linux in file /home/d04/lblunn/cylc-run/u-at199/share/fcm_make/preprocess-atmos/src/um/src/control/c_code/exceptions/exceptions-platform/exceptions-linux.c line 78
[71] exceptions: [backtrace]: (  3) : Address: [0x2001a73b] 
[71] exceptions: [backtrace]: (  3) : signal_do_backtrace in file /home/d04/lblunn/cylc-run/u-at199/share/fcm_make/preprocess-atmos/src/um/src/control/c_code/exceptions/exceptions.c line 270
[71] exceptions: [backtrace]: (  4) : Address: [0x2001ae37] 
[71] exceptions: [backtrace]: (  4) : signal_handler in file /home/d04/lblunn/cylc-run/u-at199/share/fcm_make/preprocess-atmos/src/um/src/control/c_code/exceptions/exceptions.c line 672
[71] exceptions: [backtrace]: (  5) : Address: [0x23646e70] 
[71] exceptions: [backtrace]: (  5) : __restore_rt in file sigaction.c line 672
[71] exceptions: [backtrace]: (  6) : Address: [0x24630940] 
[71] exceptions: [backtrace]: (  6) : __cray_dcopy_HSW (* Cannot Locate *)

This has been occurring intermittently, and we've gotten round this simply by re-triggering the ukv_um_fcast task.

I had hoped that by giving the task more processors (we've gone from a total of 288 to 360) that this type of error would become less likely, but that clearly has not happened.

Last edited 3 months ago by willie (previous) (diff)

comment:2 Changed 3 months ago by willie

Hi Lewis,

I tried 'safe' and 'debug' optimisations and the same error appeared in each case. The debug did provide helpful information after the segmentation fault at 120 time steps.

The problem appears to occur in diagnostics.F90 (line 336) when copying the STASH 13, 192 - SMAG: S (shear term) - when allocating space for diagnostics in atm_step_4A.F90.

I tried switching this STASH off and was successfully able to complete the run first time at 'debug' level and also at 'high' optimisation.

So if you can do without this diagnostic, it would make running easier.

To go further than this and fully identify the problem will take a lot more effort.

Regards
Willie

comment:3 Changed 3 months ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to assigned
  • UM Version set to 10.5

comment:4 Changed 3 months ago by willie

  • Keywords STASH, segmentation fault added
  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.