Opened 5 months ago

Last modified 5 months ago

#3539 new help

CAP crashing in singularity container

Reported by: pmcguire Owned by: um_support
Component: UM Tools Keywords: ANTS, CAP, singularity, container, JASMIN
Cc: Platform: JASMIN
UM Version:

Description

Hi Simon:
I have been trying to figure out the CAP seg fault that I have been getting when I use your ANTS/CAP singularity container with my JASMIN suite ~pmcguire/roses/u-bv358try7. I can now replicate this error with a singularity shell with this container without the rose/cylc suite by defining the right environment variables with these 2 linux commands on JASMIN:

1)
cd /home/users/pmcguire/cylc-run/u-bv358try7/work/1/ancilMask

2)
env MODEL="n216e/orca025" env ANCIL_PRG_EXEC="central_ancillary.exe" \

env "HARDWARE=LINUX" \

env "ROSE_DATA=/home/users/pmcguire/cylc-run/u-bv358try7/share/data" \

env SINGULARITYENV_PREPEND_PATH="/opt/CAP/bin" \

singularity shell -B /gws/nopw/j04/rdf_migrate_vol1 \

-B /work/scratch-pw/pmcguire ~siwilson/ANTS/test/ants_test.sif

and then this singularity command:

Singularity> AncilScr_RoseMask

Do you have any suggestions for how to fix this? I will include the seg fault error message in this ticket in the comments/replies. I will also include the same command that seems to work without the CAP/ANTS container.
Patrick

Change History (8)

comment:1 Changed 5 months ago by pmcguire

Singularity> AncilScr_RoseMask
+ date
+ date +%Y%m%d%H%M00
+ echo ‘%Script AncilScr_RoseMask starting at Tue May 18 00:50:37 BST 2021 - (20210518005000)’
%Script AncilScr_RoseMask starting at Tue May 18 00:50:37 BST 2021 - (20210518005000)
+ . AncilScr_RoseSetup
+ set -x
+ uname -a
+ HARDWARE=‘Linux sci1.jasmin.ac.uk 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 GNU/Linux’
+ export HARDWARE
+ date
+ date +%Y%m%d%H%M00
+ echo ‘%Script AncilScr_RoseMask starting at Tue May 18 00:50:37 BST 2021 - (20210518005000) on ’ Linux sci1.jasmin.ac.uk 3.10.0-1160.15.2.el7.x86_64 ‘#1’ SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 GNU/Linux
%Script AncilScr_RoseMask starting at Tue May 18 00:50:37 BST 2021 - (20210518005000) on Linux sci1.jasmin.ac.uk 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 GNU/Linux
+ ANCIL_DIR=/home/users/pmcguire/cylc-run/u-bv358try7/share/data/n216e/orca025
+ export ANCIL_DIR
+ ANCIL_RUN_DIR=/home/users/pmcguire/cylc-run/u-bv358try7/share/data/n216e/orca025
+ export ANCIL_RUN_DIR
+ ANCIL_GRID=grid.nl
+ export ANCIL_GRID
+ ANCIL_SEARCH=search.nl
+ export ANCIL_SEARCH
+ ANCIL_VERTLEVS=verlevs.nl
+ export ANCIL_VERTLEVS
+ ANCIL_HORIZGRID=horizgrid.nl
+ export ANCIL_HORIZGRID
+ UNIT22=/vn/ctldata/stashmaster
+ export UNIT22
+ [ -n ‘’ ]
+ VERT_NAME_LIST=/home/users/pmcguire/cylc-run/u-bv358try7/share/data/etc/n216e/orca025/vertlevs.nl
+ export VERT_NAME_LIST
+ [[ ‘’ == .true. ]]
+ VARIABLE=F
+ export VARIABLE
+ MODEL=n216e/orca025
+ export MODEL
+ NAMECTRL=namectrl.nl
+ export NAMECTRL
+ whence central_ancillary.exe
+ ANCIL_EXEC_FULLPATH=/opt/CAP/bin/central_ancillary.exe
+ export ANCIL_EXEC_FULLPATH
+ IPACK=‘’
+ export IPACK
+ MASKIN=‘’
+ export MASKIN
+ set +e
+ rm fort.15 fort.38 fort.50 fort.56 fort.7 fort.4
rm: cannot remove ‘fort.15’: No such file or directory
+ set -e
+ ln -s fort.38
+ ln -s fort.50
+ ln -s README fort.7
+ rm -f namectrl.nl
+ [[ F == true ]]
+ [[ F == T ]]
+ cat /home/users/pmcguire/cylc-run/u-bv358try7/share/data/etc/n216e/orca025/search.nl /home/users/pmcguire/cylc-run/u-bv358try7/share/data/etc/n216e/orca025/grid.nl /home/users/pmcguire/cylc-run/u-bv358try7/share/data/etc/n216e/orca025/vertlevs.nl mask.nl lakes.nl vegfrac.nl veg_polygons.nl ancill.nl
+ 1>> namectrl.nl
+ ln -s namectrl.nl fort.4
+ [[ ‘Linux sci1.jasmin.ac.uk 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 GNU/Linux’ == +(cray) ]]
+ ln -s fort.56
+ [[ ‘’ == nci ]]
+ [[ ‘Linux sci1.jasmin.ac.uk 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 GNU/Linux’ == +(cray) ]]
+ [[ ‘Linux sci1.jasmin.ac.uk 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 GNU/Linux’ == +(Ubuntu) ]]
+ [[ ‘Linux sci1.jasmin.ac.uk 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 GNU/Linux’ == +(Linux) ]]
+ /opt/CAP/bin/central_ancillary.exe
 =====================================================
 GCOM Version 4.2
 MPP
 Using precision : 32bit INTEGERs and 32bit REALs
 Built at vn5.0
 =====================================================
 Output for PE           0
 No of PEs used in this run           1
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x7fcb3987783f in ???
#1 0x7fcb39cbb262 in format_hash
	at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638345833/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgfortran/io/format.c:128
#2 0x7fcb39cbb262 in find_parsed_format
	at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638345833/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgfortran/io/format.c:163
#3 0x7fcb39cc6449 in data_transfer_init
	at /home/conda/feedstock_root/build_artifacts/ctng-compilers_1578638345833/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libgfortran/io/transfer.c:2793
#4 0x55ad2f4e7c5a in ???
#5 0x55ad2f4e793e in ???
#6 0x7fcb3986409a in ???
#7 0x55ad2f4e7979 in ???
#8 0xffffffffffffffff in ???
/opt/CAP/bin/AncilScr_RoseMask: line 67: 14799: Memory fault
Segmentation fault

comment:2 Changed 5 months ago by pmcguire

Here are the four commands for using CAP with the proper input parameters without the container. I had to do a module load gcc first.

1)
export PATH=~pmcguire/CAP9.1/build/bin:$PATH

2)
cd /home/users/pmcguire/cylc-run/u-bv358try7/work/1/ancilMask

3)
module load gcc

4)
env MODEL="n216e/orca025" env ANCIL_PRG_EXEC="central_ancillary.exe" \

env "HARDWARE=LINUX" env "ROSE_DATA=/home/users/pmcguire/cylc-run/u-bv358try7/share/data" \

AncilScr_RoseMask

comment:3 Changed 5 months ago by pmcguire

Here is the output of that container-free command. It makes it a lot further than when trying this within the container:

+ date
+ date +%Y%m%d%H%M00
+ echo ‘%Script AncilScr_RoseMask starting at Tue 18 May 17:53:01 BST 2021 - (20210518175300)’
%Script AncilScr_RoseMask starting at Tue 18 May 17:53:01 BST 2021 - (20210518175300)
+ . AncilScr_RoseSetup
+ set -x
+ uname -a
+ HARDWARE=‘Linux host595.jc.rl.ac.uk 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux’
+ export HARDWARE
+ date
+ date +%Y%m%d%H%M00
+ echo ‘%Script AncilScr_RoseMask starting at Tue 18 May 17:53:01 BST 2021 - (20210518175300) on ’ Linux host595.jc.rl.ac.uk 3.10.0-1160.15.2.el7.x86_64 ‘#1’ SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
%Script AncilScr_RoseMask starting at Tue 18 May 17:53:01 BST 2021 - (20210518175300) on Linux host595.jc.rl.ac.uk 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
+ ANCIL_DIR=/home/users/pmcguire/cylc-run/u-bv358try7/share/data/n216e/orca025
+ export ANCIL_DIR
+ ANCIL_RUN_DIR=/home/users/pmcguire/cylc-run/u-bv358try7/share/data/n216e/orca025
+ export ANCIL_RUN_DIR
+ ANCIL_GRID=grid.nl
+ export ANCIL_GRID
+ ANCIL_SEARCH=search.nl
+ export ANCIL_SEARCH
+ ANCIL_VERTLEVS=verlevs.nl
+ export ANCIL_VERTLEVS
+ ANCIL_HORIZGRID=horizgrid.nl
+ export ANCIL_HORIZGRID
+ UNIT22=/vn/ctldata/stashmaster
+ export UNIT22
+ [ -n ‘’ ]
+ VERT_NAME_LIST=/home/users/pmcguire/cylc-run/u-bv358try7/share/data/etc/n216e/orca025/vertlevs.nl
+ export VERT_NAME_LIST
+ [[ ‘’ == .true. ]]
+ VARIABLE=F
+ export VARIABLE
+ MODEL=n216e/orca025
+ export MODEL
+ NAMECTRL=namectrl.nl
+ export NAMECTRL
+ whence central_ancillary.exe
+ ANCIL_EXEC_FULLPATH=/home/users/pmcguire/CAP9.1/build/bin/central_ancillary.exe
+ export ANCIL_EXEC_FULLPATH
+ IPACK=‘’
+ export IPACK
+ MASKIN=‘’
+ export MASKIN
+ set +e
+ rm fort.15 fort.38 fort.50 fort.56 fort.7 fort.4
rm: cannot remove ‘fort.15’: No such file or directory
+ set -e
+ ln -s fort.38
+ ln -s fort.50
+ ln -s README fort.7
+ rm -f namectrl.nl
+ [[ F == true ]]
+ [[ F == T ]]
+ cat /home/users/pmcguire/cylc-run/u-bv358try7/share/data/etc/n216e/orca025/search.nl /home/users/pmcguire/cylc-run/u-bv358try7/share/data/etc/n216e/orca025/grid.nl /home/users/pmcguire/cylc-run/u-bv358try7/share/data/etc/n216e/orca025/vertlevs.nl mask.nl lakes.nl vegfrac.nl veg_polygons.nl ancill.nl
+ 1>> namectrl.nl
+ ln -s namectrl.nl fort.4
+ [[ ‘Linux host595.jc.rl.ac.uk 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux’ == +(cray) ]]
+ ln -s fort.56
+ [[ ‘Linux host595.jc.rl.ac.uk 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux’ == +(cray) ]]
+ [[ ‘Linux host595.jc.rl.ac.uk 3.10.0-1160.15.2.el7.x86_64 #1 SMP Wed Feb 3 15:06:38 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux’ == +(Linux) ]]
+ echo ‘running CAP’
running CAP
+ echo ANCIL_EXEC_FULLPATH=/home/users/pmcguire/CAP9.1/build/bin/central_ancillary.exe
ANCIL_EXEC_FULLPATH=/home/users/pmcguire/CAP9.1/build/bin/central_ancillary.exe
+ /home/users/pmcguire/CAP9.1/build/bin/central_ancillary.exe
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
 Local host:      host595
 Local device:     mlx5_bond_0
 Local port:      1
 CPCs attempted:    rdmacm, udcm
--------------------------------------------------------------------------
 =====================================================
 GCOM Version 4.2
 MPP
 Using precision : 64bit INTEGERs and 64bit REALs
 Built at vn5.1
 =====================================================
 Output for PE           0
 No of PEs used in this run           1
**************************************************************
 *                              *
 * CENTRAL ANCILLARY FILE CREATION PROGRAM          *
 * Interpolates source ancillary file data to other grids  *
 *                              *
 *************************************************************
 FOLLOWING DATASETS REQUESTED
 Land sea mask
 REQUIRED GRID DEFINED AS FOLLOWS
 Number of columns          432
 Number of rows          324
 Which gives        139968 points
 Target grid is ENDgame grid
 Grid covers entire globe
 Grid pole at geographical North Pole  90.000000000000000    0.0000000000000000   
 Longitude (NW) origin is  0.41666666666666669   
 Latitude (NW) origin is  89.722222222222229   
 Longitude resolution supplied as  0.83333333333333337   
 Latitude resolution supplied as  0.55555555555555558   
 USING         2048 AS UM_SECTOR_SIZE
 INVALID IDAY_YEAR          365
 MUST BE EITHER 1, 2, 10 or 20
 PROGRAM ABORTS
gc_abort (Processor   0): ANCIL_EXIT
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 9.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
(base) [pmcguire@sci3.jasmin.ac.uk ancilMask]$ [host595.jc.rl.ac.uk:19846] PMIX ERROR: NO-PERMISSIONS in file gds_dstore.c at line 704
[host595.jc.rl.ac.uk:19846] PMIX ERROR: NO-PERMISSIONS in file gds_dstore.c at line 713

comment:4 Changed 5 months ago by simon

Hi Patrick,

Firstly, thanks for the detailed bug report, it's been very useful.

I think I've fixed the issue, and a further potential one. I've managed to get the container to
progress pass the failure point, but I've been unable to fully replicate your jasmin environment
to test it fully, so could you have a go? The new container version is in /home/users/siwilson/ANTS/test/ants_test_new.sif

The issue was that I was compiling the CAP in the ANTS conda env when building the container.
This caused issues with inconsistent include files and run-time libraries, so the executable was failing during a simple write to stdout. I'm now compiling it outside the conda env. The other potential issue was that 32bit gcom libraries were used rather than 64bit.

Let me know how you get on.

Simon.

comment:5 Changed 5 months ago by pmcguire

Hi Simon:
Many thanks!
I am trying the new container now.
I will let you know how it goes.
Patrick

comment:6 Changed 5 months ago by pmcguire

Hi Simon:
The ancilMask CAP app has now finished, apparently successfully, for the Rose/Cylc suite ~pmcguire/roses/u-bv358try8. Thank you!

The log files have an ASCII representation in them of the land/sea mask, which is output by CAP. See:
~pmcguire/cylc-run/u-bv358try8/log/job/1/ancilMask/01/job.out.
Here is the east-central Atlantic and UK region part of that mask:

000000000000000000000000000000000000000011
000000000000000000000000000000000000000011
000000000000000000000000000000000000000000
000000000000000000000000000000000000011000
000000000000000000000000000000000011111000
000000000000000000000000000000000111111000
000000000000000000000000000000001111111100
000000000000000000000000000000000011111100
000000000000000000000000000000000011111100
000000000000000000000000000000000011111110
000000000000000000000000000000011111111110
000000000000000000000000000000011111111111
000000000000000000000000000000111111111111
000000000000000000000000000001111110111111
000000000000000000000000000001111110111111
000000000000000000000000000000111110111111
000000000000000000000000000001111111111111
000000000000000000000000000000110000111111
000000000000000000000000000000000001111111
000000000000000000000000000000000001110100
000000000000000000000000000000000000000110
000000000000000000000000000000000000000111
000000000000000000000000000000000000111111
000000000000000000000000000000000000111111

Currently, the ancilOrog CAP app is running.
Patrick

comment:7 Changed 5 months ago by pmcguire

Hi Simon:
The ancilOrog CAP app seemed to work fine when the Rose/Cylc suite ~pmcguire/roses/u-bv358try8 uses your /home/users/siwilson/ANTS/test/ants_test_new.sif container. Thank you.

But the next CAP app (and maybe the last one for now) ancilVegfrac doesn't work with the container.
I get this error message:
~pmcguire/cylc-run/u-bv358try8/log/job/1/ancilVegfrac/01/job.err

At line 181 of file /opt/CAP_build/preprocess/src/um/src/utility/qxreconf/box_sum.F90

Fortran runtime error: Index '181' of dimension 2 of array 'source' above upper bound of 180

I haven't been able to reproduce this error yet, when I don't use the CAP version of the container, but instead use the CAP binary in ~pmcguire/CAP9.1 (copied from your copy some time ago) primarily because I don't get that far, I think. I am using ~pmcguire/roses/u-bv358noCAPContainer1 to try to reproduce this without the container.

The problem without the container, is the problem mentioned in Slack that the FORTRAN code cannot read fort.n files that are softlinks to files on the rdf_migrate_vol1 GWS with the CAP9.1 code that doesn't have OPEN(n) lines prior to the READ(n,*) variable lines.

As I figured out previously, this softlink/fort.n issue seems ok with ifort but not gfortran.
Is the CAP part of the ANTS/CAP container being compiled with ifort or mpif90 or gfortran?

I tried to recompile my copy (~pmcguire/CAP9.1b) of your CAP9.1 compiled code. But I haven't gotten very far. I can't figure out right now what modules I need to load in order to get the default compiler (mpif90, in the fcm-make.cfg file) to be found. I wanted to maybe try to switch compilers in there to ifort, or something.

Is there source code for your ANTS/CAP container somewhere where I could look at it?
Patrick

comment:8 Changed 5 months ago by simon

Hi Patrick,

Have you seen the README.txt in /home/users/siwilson/CAP9.1? I'll reproduce it here:
To build on JASMIN:

module load eb/OpenMPI/gcc/4.0.0
module load gcc/8.2.0
rm -fr build/
fcm make -f fcm-make.cfg

The CAP inside the container is built with gcc 8.3 and the source code is CAP 9.2
(revision 5766) described here
https://code.metoffice.gov.uk/trac/ancil/wiki/WikiStart?version=106#CentralAncillaryProgramCAP

Simon.

Note: See TracTickets for help on using tickets.