Opened 4 weeks ago

Last modified 3 days ago

#2530 accepted help

segmentation fault

Reported by: amenon Owned by: willie
Priority: normal Component: UM Model
Keywords: segmentation fault Cc:
Platform: ARCHER UM Version: 10.9

Description

Hi,
I am getting a segmentation fault in the forecast step of my suite u-ay368. The error goes like this:

[5] exceptions: An exception was raised:11 (Segmentation fault)
[7] exceptions: An exception was raised:11 (Segmentation fault)
[5] exceptions: the exception reports the extra information: Address not mapped to object.
[7] exceptions: the exception reports the extra information: Address not mapped to object.
[5] exceptions: whilst in a serial region
[7] exceptions: whilst in a serial region
[5] exceptions: Task had pid=25557 on host nid00032
[7] exceptions: Task had pid=25559 on host nid00032
[5] exceptions: Program is "/work/n02/n02/amenon/cylc-run/u-ay368/share/fcm_make/build-atmos/bin/um-atmos.exe"
[7] exceptions: Program is "/work/n02/n02/amenon/cylc-run/u-ay368/share/fcm_make/build-atmos/bin/um-atmos.exe"
[5] exceptions: calling registered handler @ 0x004177a0
[7] exceptions: calling registered handler @ 0x004177a0
[672] exceptions: An exception was raised:11 (Segmentation fault)
[672] exceptions: the exception reports the extra information: Address not mapped to object.
[672] exceptions: whilst in a serial region
[672] exceptions: Task had pid=26321 on host nid00499 

I tried changing the
module load cray-netcdf/4.4.1.1
module load cray-hdf5/1.10.0.1
in the suite-adds.rc file as suggested in ticket #2251. It doesn't work still. Could you please help. Many thanks.

Arathy

Change History (13)

comment:1 Changed 3 weeks ago by grenville

Arathy

While the reconfiguration claimed to have worked - it has produced bad surface fields; look at surface temperature for example.

Not sure what's happening here.

Grenville

comment:2 Changed 3 weeks ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Arathy,

These tickets #2428, #2490, #2500, and this one all have the same fundamental issue, namely that the reconfiguration has failed to work. I think we should focus on getting the reconfiguration in u-ay368 to work.

I will look at this.

Regards
Willie

comment:3 Changed 3 weeks ago by amenon

Thanks Willie. Looking forward to somehow get over this issue.

comment:4 Changed 3 weeks ago by grenville

Arathy

Are you sure about the values for rg01_centre (20, -75) in suite u-ay368?
u-ai540 has values (20, 75).

I used (20, 75) in my copy of u-ay368 and it created a start file which looks OK - and it ran perfectly OK with spiral search (because the number of unresolved points was O(100) times fewer than worth (20, -75)).

(20,-75) appears to put you on the wrong part of the planet.

Grenville

comment:5 Changed 3 weeks ago by amenon

That's terrible. It is (20,75). Thanks Grenville. That '-' sign must have been put there accidentally while I was typing somewhere else. I will restart the suite now and see how it goes.

comment:6 Changed 3 weeks ago by amenon

Hi Grenville,

By changing the domain size to (20,75) and setting the coast_adj_method back to 'spiral' from 'standard', the reconfiguration job succeeded. But I still have segmentation fault in the forecast job.

comment:7 Changed 2 weeks ago by amenon

Hi Grenville,
I checked the start dump files and they look fine. I will restart the suite after switching on more diagnostics and will get back to you.

comment:8 Changed 2 weeks ago by grenville

Hi Arathy

The problem is in STASH - I've switched off a lot of diagnostics and the model runs out to >850 time steps.

Have you added stash not present in the Monsoon equivalent?

I'll play to try identify the offending diagnostic, but it's a potentially slow process.

Grenville

comment:9 Changed 2 weeks ago by amenon

Hi Grenville,

Thanks. I added a lot of new stashes into this run such as the theta and PV tracers from a new branch (0578-0608), 4111-4118 and many others.

comment:10 Changed 8 days ago by simon

I've managed to get the model to complete by changing a number of entries in the STASH UI. There were anumber of problems, mainly that the time profiles were configured to output data before it was available. I've fixed those, and also changed TS_MIN and MS_MAX to be daily mins and maxes, but I don't know if these are what is actually required. I also added three new fields, 30203, 30204 and 30207. These are required to produce the products also asked for. Finally I removed packing from the accumulated increments to make them work.

The changes from your config are:

Time Profile:
T15MN_MN
sampling period to Timesteps
istr to 15

T1DY_MN
sampling period to Timesteps
istr to 1

T1HR
istr to 1

T1HR_MN
sampling period to Timesteps

T3HR
istr to 3

T3HR_MN
sampling period to Timesteps
istr to 3

TACC6HR
istr to 6

TS_MAX
Don't know what is required: for daily max:
unt1 Days
isam 1
unt3 Days
iopt Regular
istr 1
iend -1
ifre 1

TS_MIN
Don't know what is required: for daily min:
unt1 Days
isam 1
unt3 Days
iopt Regular
istr 1
iend -1
ifre 1

STASH Requests:
Extra required:
30203 PLEVS T3HR_MN 66_DIAGS
30204 PLEVS T3HR_MN 66_DIAGS
30207 PLEVS T3HR_MN 66_DIAGS
These are clones of the 30201 T3HR_MN

Model Output Streams:
pp10 packing 'Unpacked'

After running. please look at the output files to see the diagnostics are as expected.

If you copy /home/simon/roses/u-ay368-am/app/um/rose-app.conf to the equivalent directory under your account, and restart the rose edit, the changes should be picked up.

comment:11 Changed 6 days ago by amenon

Hi Simon,

Many thanks for sorting this out. I could complete the suite after copying these changes. These are some of the issues that I figured out after checking the diagnostics:

  1. It has not output any diagnostics from the tracers branch that I have added. Those diagnostics are pointed to the output stream PJ. Output stream is created, but it is empty (For eg., /nerc/n02/n02/amenon/u-ay368/field.pp/20160701T0000Z_INCOMPASS_km4p4_RA1T_pj000.pp). Do I need to make any additional changes in the suite to get those diagnostics out?
  1. By using time profile T15MN_MN, I intend to produce outputs every 15 minutes by averaging over all the time steps in that 15 minutes time period. But with the current set up of that time profile, I am getting outputs at varying frequency like 20 minutes, 27 minutes etc.
  1. By using TS_MAX and TS_MIN profiles, I am trying to get the maximum value from all time steps in an hour. I hope if I change unt1 and unt3 from 'Days' to 'Hours' I should get these values. Is that right?
  1. Some diagnostics are not created even though their profiles look right, for eg., diagnostics 3287, 5185, 5186.
  1. Time profile TACC6HR is used to output accumulated data every 6 hours (accumulation over 6 hours). 79_DIAGS that uses this time profile produces 4 outputs a day (which must be true if it is doing a 6 hourly accumulation), but the time domain for those outputs (the time attribute that you see when you open that output using xconv) are 3Z, 6Z, 12 Z and 18Z. Do you know why is it so?

Sorry for so many questions. Please let me know if anything is unclear.

Thanks.

comment:12 Changed 6 days ago by simon

Hi,

  1. I think all of the variables written to the j stream are unavailable for the model version you are using, if you look at the log file, there are messages of the type:

????????????????????????????????????????????????????????????????????????????????
?????????????????????????????? WARNING ??????????????????????????????
? Warning code: -30
? Warning from routine: PRELIM
? Warning message:
? Field - Section:0, Item:581 request denied.
? Unavailable to this model version.
? Warning from processor: 0
? Warning number: 23
????????????????????????????????????????????????????????????????????????????????

  1. This was my error, change unt3 from timesteps to minutes.
  1. This should work.
  1. Again, these diagnostics are unavailable for your model (along with a number of others).
  1. I've looked at the header information of the files with pumf, and they look OK. I don't know how xconv is calculating the validity time for the fields. xconv can sometimes give spurious information over data times, and it's always best to use another method, such as pumf, to check these.

comment:13 Changed 3 days ago by simon

Hi,

You can write some of the j stream diagnostics by turning on l_pv_tracer and its subsections in Section 33. Unfortunately turning on the diabatic tracer causes the model to fail. You also need to set the reserved_header to 10000 in the pp9 Model Output Stream section. The PV diagnostics appear to be written in the output file.

Note: See TracTickets for help on using tickets.