Opened 3 months ago

Closed 2 months ago

#3269 closed help (fixed)

Address not mapped to object ERROR - is it the STASH?

Reported by: mvguarino Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version:

Description

Hello,

My suite u-bt694 running on MONSOON is giving me an error (below) I never encountered before.

The suite is a copy of the CMIP6 Hist member 1 (HadGEM3-GC3.1) u-bg466 withe the following two differences:
-the source code for the UM is different. I have tested the changes made to the code by running a 1-month simulation on ARCHER using the same model a few weeks ago -so I exclude this can be the problem (also fcm_make_um runs fine)
-the STASH is different: I have added a few more STASH items from section 6

I have been looking for previous tickets having similar (perhaps) problems and I convinced myself that the new STASH items might be the problem (see for example http://cms.ncas.ac.uk/ticket/25300).

However I couldn't identify what is wrong with my STASH or if there is indeed a mismatch between time_profiles and variables availability (I have selected variables that are available on all time steps).

As I am stumbling around in the dark with this any guidance on how to proceed from here would be much appreciated.

Thank you,

Vittoria

[973] exceptions: An exception was raised:11 (Segmentation fault)
[973] exceptions: the exception reports the extra information: Address not mapped to object.
[973] exceptions: whilst in a serial region
[973] exceptions: Task had pid=58998 on host nid04001
[973] exceptions: Program is "./atmos.exe"
[1005] exceptions: An exception was raised:11 (Segmentation fault)
[1005] exceptions: the exception reports the extra information: Sent by the kernel.
[1005] exceptions: whilst in a serial region
[1005] exceptions: Task had pid=41550 on host nid04045
[1005] exceptions: Program is "./atmos.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[973] exceptions: Data address (si_addr): 0xffffffffffffffff; rip: 0x030b9a13
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[1005] exceptions: Data address (si_addr): 0x00000000; rip: 0x030b55aa
[973] exceptions: [backtrace]: has  19 elements:
[973] exceptions: [backtrace]: (  1) : Address: [0x030b9a13] 
[973] exceptions: [backtrace]: (  1) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: (  2) : Address: [0x004199ca] 
[973] exceptions: [backtrace]: (  2) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: (  3) : Address: [0x00417a3b] 
[973] exceptions: [backtrace]: (  3) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: (  4) : Address: [0x00418137] 
[973] exceptions: [backtrace]: (  4) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: (  5) : Address: [0x026acc70] 
[973] exceptions: [backtrace]: (  5) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: (  6) : Address: [0x030b9a13] 
[973] exceptions: [backtrace]: (  6) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: (  7) : Address: [0x030b9c80] 
[973] exceptions: [backtrace]: (  7) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: (  8) : Address: [0x03155bf2] 
[973] exceptions: [backtrace]: (  8) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: (  9) : Address: [0x026f8172] 
[973] exceptions: [backtrace]: (  9) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: ( 10) : Address: [0x0197a299] 
[973] exceptions: [backtrace]: ( 10) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: ( 11) : Address: [0x019640e7] 
[973] exceptions: [backtrace]: ( 11) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: ( 12) : Address: [0x01587463] 
[973] exceptions: [backtrace]: ( 12) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: ( 13) : Address: [0x00b6ccf6] 
[973] exceptions: [backtrace]: ( 13) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: ( 14) : Address: [0x004515b7] 
[973] exceptions: [backtrace]: ( 14) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: ( 15) : Address: [0x00411388] 
[973] exceptions: [backtrace]: ( 15) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: ( 16) : Address: [0x004084c8] 
[973] exceptions: [backtrace]: ( 16) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: ( 17) : Address: [0x004084c8] 
[973] exceptions: [backtrace]: ( 17) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: ( 18) : Address: [0x030bdce1] 
[973] exceptions: [backtrace]: ( 18) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: [backtrace]: ( 19) : Address: [0x00408229] 
[973] exceptions: [backtrace]: ( 19) : line information unavailable error code: 4 (Program name & path is not absolute)
[973] exceptions: 
[973] exceptions: To find the source line for an entry in the backtrace;
[973] exceptions: run addr2line --exe=</path/too/executable> <address>
[973] exceptions: where address is given as [0x<address>] above
[973] exceptions: 
[1005] exceptions: [backtrace]: has  23 elements:
[1005] exceptions: [backtrace]: (  1) : Address: [0x030b55aa] 
[1005] exceptions: [backtrace]: (  1) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: (  2) : Address: [0x004199ca] 
[1005] exceptions: [backtrace]: (  2) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: (  3) : Address: [0x00417a3b] 
[1005] exceptions: [backtrace]: (  3) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: (  4) : Address: [0x00418137] 
[1005] exceptions: [backtrace]: (  4) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: (  5) : Address: [0x026acc70] 
[1005] exceptions: [backtrace]: (  5) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: (  6) : Address: [0x030b55aa] 
[1005] exceptions: [backtrace]: (  6) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: (  7) : Address: [0x030b9890] 
[1005] exceptions: [backtrace]: (  7) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: (  8) : Address: [0x0315595e] 
[1005] exceptions: [backtrace]: (  8) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: (  9) : Address: [0x02cde439] 
[1005] exceptions: [backtrace]: (  9) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 10) : Address: [0x02cf7795] 
[1005] exceptions: [backtrace]: ( 10) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 11) : Address: [0x02c46c92] 
[1005] exceptions: [backtrace]: ( 11) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 12) : Address: [0x02c5bf5d] 
[1005] exceptions: [backtrace]: ( 12) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 13) : Address: [0x02c5c584] 
[1005] exceptions: [backtrace]: ( 13) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 14) : Address: [0x02c5e545] 
[1005] exceptions: [backtrace]: ( 14) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 15) : Address: [0x02c47597] 
[1005] exceptions: [backtrace]: ( 15) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 16) : Address: [0x02c111b1] 
[1005] exceptions: [backtrace]: ( 16) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 17) : Address: [0x02c115a8] 
[1005] exceptions: [backtrace]: ( 17) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 18) : Address: [0x02bd891b] 
[1005] exceptions: [backtrace]: ( 18) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 19) : Address: [0x023dc1e3] 
[1005] exceptions: [backtrace]: ( 19) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 20) : Address: [0x00600711] 
[1005] exceptions: [backtrace]: ( 20) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 21) : Address: [0x005edf84] 
[1005] exceptions: [backtrace]: ( 21) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 22) : Address: [0x005eda47] 
[1005] exceptions: [backtrace]: ( 22) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: [backtrace]: ( 23) : Address: [0x7fffff5c3e60] 
[1005] exceptions: [backtrace]: ( 23) : line information unavailable error code: 4 (Program name & path is not absolute)
[1005] exceptions: 
[1005] exceptions: To find the source line for an entry in the backtrace;
[1005] exceptions: run addr2line --exe=</path/too/executable> <address>
[1005] exceptions: where address is given as [0x<address>] above
[1005] exceptions: 
[NID 04001] 2020-05-18 14:00:30 Apid 105947214: initiated application termination
[FAIL] run_model # return-code=137
2020-05-18T14:00:35Z CRITICAL - failed/EXIT

Change History (10)

comment:1 Changed 3 months ago by jeff

Hi Vittoria

I ran your suite with PRINT_STATUS set to "Extra diagnostic messages" and there was lots of output like this

EasyAerosol level 1: (max, min, mean) Band: *          -Infinity          Infinity               NaN

I'm assuming this isn't normal behaviour. Does this help with narrowing down the problem?

Jeff.

comment:2 Changed 3 months ago by mvguarino

Hi Jeff,

Thanks for taking a look.
I think this tells me that the error is not caused by my STASH modifications, but I have not touched anything else (and in particular I have never dealt with the aerosol scheme) so I am still clueless.

Could this be related to the ozone task? I had some difficulties to get that running but in the end I managed.

I am also re-running now with extra diagnostic messages - didn't think about it

Vittoria

comment:3 Changed 3 months ago by jeff

Hi Vittoria

It looks like the problem with NaNs? isn't the reason the job crashes. This was a small bug which causes the UM to print out rubbish data but the actual data was ok.

If you wanted to fix this, the problem is in your branch vn10.7_OGWD_seaice_CMIP6, in file easyaerosol_read_input_mod.F90 at line number 1555, change

          DO k = 1, dimsize(4)

to

          DO k = 1, dimsize(3)

The search for the actual problem goes on.

Jeff.

comment:4 Changed 3 months ago by mvguarino

Hi Jeff,

Thank you, I have fixed the bug in my code version.

I have run a copy of the suite without my STASH modifications (u-bu689) and that ran fine…
So the problem seems to be the STASH indeed, but I still cannot see what is wrong with it.

Vittoria

comment:5 Changed 3 months ago by mvguarino

Hi Jeff,

I have identified the variables that caused the model to crash:
STASH item 235 (X-component of surface SSO stress)
STASH item 236 (Y-component of surface SSO stress)
both from section 6.

I am still not clear what the problem was, I have tried different combinations of domain and time profiles but nothing worked. Eventually, I decided to leave them out since, as per variable description, the output of this diagnostic is identical to STASH item 201/202 at model level 1.

The simulation ran fine for the first few cycles, it is now failing because of a mismatch between the files RETRIEVE_OZONE.19510101T0000Z expects to find in MASS and what postproc_atmos archived at the end of the previous cycle (1950100101T0000Z):

[ERROR] Only 11 file[s] available for 1950.  Expected 12

Indeed postproc archived .pp files from jan1950 to nov1950:

mvguar@xcslc0:~> moo ls :/crum/u-bt694/ap4.pp
moose:/crum/u-bt694/ap4.pp/bt694a.p41950apr.pp
moose:/crum/u-bt694/ap4.pp/bt694a.p41950aug.pp
moose:/crum/u-bt694/ap4.pp/bt694a.p41950feb.pp
moose:/crum/u-bt694/ap4.pp/bt694a.p41950jan.pp
moose:/crum/u-bt694/ap4.pp/bt694a.p41950jul.pp
moose:/crum/u-bt694/ap4.pp/bt694a.p41950jun.pp
moose:/crum/u-bt694/ap4.pp/bt694a.p41950mar.pp
moose:/crum/u-bt694/ap4.pp/bt694a.p41950may.pp
moose:/crum/u-bt694/ap4.pp/bt694a.p41950nov.pp
moose:/crum/u-bt694/ap4.pp/bt694a.p41950oct.pp
moose:/crum/u-bt694/ap4.pp/bt694a.p41950sep.pp

leaving out dec1950, although December data is available in share/data/History_Data

How to fix this mismatch so that all 12 monthly files are on MASS when the next cycle starts?

Thank you,

Vittoria

comment:6 Changed 3 months ago by mvguarino

Hello,

Could someone advise on the issue reported above with postproc?
Let me know if I have to open a separate ticket for it.

Many thanks,

Vittoria

comment:7 Changed 3 months ago by jeff

Hi Vittoria

Concerning the original problem I think I have tracked down what is wrong, the data array for STASH item 6,236 is passed to another subroutine with incorrect bounds, which causes memory problems and the Segmentation fault. I need to check whether this is still a problem in the latest version and let the met office know if it is.

Concerning the new problem, the UM won't archive the dec file until it knows it has been finished writing to, i,e, when the .arch file is written. This isn't done until the jan file has been opened so the dec file does not get archived until the next cycle.

It looks like the retrieve_zone app knows this and sym links the dec file from the share/data/History_Data directory so it has all the data for the year. If you look in cylc-run/u-bt694/work/19510101T0000Z/retrieve_ozone you can see a sym link but it isn't correct. The /* shouldn't be there I think. This part of the path comes from variable REMOTE_SUITE_LINK defined in roses/u-bt694/site/monsoon.rc. Try changing this so it doesn't have the /* and rerun.

I can't test this as I don't have access to mass, but hopefully it will work.

Jeff.

comment:8 Changed 2 months ago by mvguarino

Hi Jeff,

Okay, thank you very much!
I will try correcting the wrong sym link as soon as the current issues with the SSL certificates on MONSOON will be solved and let you know, so that this ticket can be closed.

Glad I helped finding a (potential) code bug :)

Vittoria

comment:9 Changed 2 months ago by mvguarino

Hello,

Suite appears to be running fine.
Thanks again for your help,

Vittoria

comment:10 Changed 2 months ago by ros

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.