Opened 2 years ago

Closed 2 years ago

#2251 closed help (fixed)

Segmentation fault

Reported by: jfgu Owned by: annette
Component: UM Model Keywords: modules gcom
Cc: Platform: ARCHER
UM Version: 10.6

Description

Hi CMS helpdesk,

I have some issues with runing UM. I have a suite u-ap518 copied from another suite u-an747, which runs ok this morning. I didn't make any change to u-ap518. I submitted a job with this suite. The suite stopped at the beginning of running. The err file says:

[11] exceptions: An exception was raised:11 (Segmentation fault)
[11] exceptions: the exception reports the extra information: Address not mapped to object.
[11] exceptions: whilst in a serial region
[11] exceptions: Task had pid=10622 on host nid00010
[11] exceptions: Program is "/work/n02/n02/jfgu/cylc-run/u-ap518/share/fcm_make/build-atmos/bin/um-atmos.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked

The job.err also pointed to the lines of codes causing the fault. Most of the codes come from the control/top_level, which don't seem unusual. However, there is a line saying:
[65] exceptions: [backtrace]: ( 10) : mpl_waitall_ in file /home/n02/n02/annette/cylc-run/vn6.1_gcom_trunk/share/archer_xc30_cce_mpp/preprocess/src/gcom/mpl/mpl_waitall.F90 line 46

When I tried to read the Fortran file, there was an indication that the directory named "vn6.1_gcom_trunk" did not exist. So I guess this should not be the problem of my suite. Has some setups in this path been changed?

Please could you help me with this? Thank you very much.

Jian-Feng

Change History (9)

comment:1 Changed 2 years ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to accepted

Jian-Feng,

I am looking into this for you.

Annette

comment:2 Changed 2 years ago by annette

  • Keywords modules gcom added
  • Platform set to ARCHER
  • Status changed from accepted to pending
  • UM Version changed from <select version> to 10.6

Hi Jian-Feng,

ARCHER update the default modules recently, and it looks like your suite needs to go back to the previous version. In your suite.rc file, replace these lines:

module load cray-netcdf/4.4.1.1
module load cray-hdf5/1.10.0.1

with these lines:

module load cdt/15.11
module load cray-netcdf/4.3.2
module load cray-hdf5/1.18.13

You also have a compiler override to set -hlist=ad. Is this something you added? It makes the compiler list the optimisations applied. It will probably slow the compiler down, so unless you are looking at this I would remove it.

You will need to do a full rebuild with --new as you were doing before. I have tested this and the suite runs for me.

Other UM suites we have tested work fine with the new modules, so I am not sure why yours is sensitive to this.

Annette

comment:3 Changed 2 years ago by jfgu

Hi Annette,

Thanks a lot. I didn't add "-hlist=ad" to the suite.rc. What I changed is the module. My original suite is copied from Todds. The module loaded in that suite has been deleted on archer. I followed the suggestions in another ticket to set the modules and the suite ran OK at that time.

I will follow your suggestions to load the new modules, remove "-hlist=ad" and rebuild the model. Archer is under maintenance today. I will let you know if it works once the computing node on Archer is available.

Jian-Feng

comment:4 Changed 2 years ago by annette

Hi Jian-Feng,

If it ran before the upgrade, then you can then probably still use the new netcdf/hdf5 modules and just back-track the other modules, so:

module load cdt/15.11
module load cray-netcdf/4.4.1.1
module load cray-hdf5/1.10.0.1

It is important to load cdt before the other modules, so do not change the order.

If you use the rose edit gui, you can find the compiler override under fcm_make → env → Advanced compilation.

Annette

comment:5 Changed 2 years ago by jfgu

Hi Annette,

Thanks. I will remove -hlist=ad in the fcm_make section. I didn't load moudle cdt/15.11 in the previous suite. Could this be the reason for the failure?

Jian-Feng

comment:6 Changed 2 years ago by annette

Hi Jian-Feng,

The cdt/15.11 line sets your modules to what as they were before ARCHER changed them. So you should end up with the same setup you had when it was working previously.

Annette

comment:7 Changed 2 years ago by jfgu

Hi Annette,

OK, I will let you know if there is still a problem when Archer is back tomorrow. Thank you.

Jian-Feng

comment:8 Changed 2 years ago by jfgu

Hi Annette,

The suites work well now. Thanks for your suggestions.

Jian-Feng

comment:9 Changed 2 years ago by annette

  • Resolution set to fixed
  • Status changed from pending to closed
Note: See TracTickets for help on using tickets.