Opened 5 months ago

Closed 5 months ago

#3384 closed help (fixed)

Transitioning rose/cylc suites to SLURM

Reported by: mtodt Owned by: jules_support
Component: JULES Keywords: SLURM, cylc, JASMIN
Cc: Platform: JASMIN
UM Version:

Description

Hi

I'm trying to transition suites I have been and am currently running to SLURM, starting with u-bv464. I've set up login to cylc1.jasmin. In suite.rc I've changed job submissionmethod from lsf to slurm and job specifications for fcm_make and jules according to the JASMIN help page. This seems to work, however, fcm_make nevertheless fails with the following error message:

[FAIL] mpif90 -oo/water_constants_mod.o -c -DSCMA -DBL_DIAG_HACK -DCOMPILER_INTEL -I./include -I/gws/nopw/j04/jules/jules_build/libs/include -heap-arrays -fp-model precise -traceback -I/gws/nopw/j04/jules/jules_build/libs/include -ip -no-prec-div -static-intel -lz -lm /home/users/mtodt/cylc-run/u-bv464/share/fcm_make/preprocess/src/jules/src/params/standalone/water_constants_mod_jls.F90: command not found
[FAIL] compile    0.0 ! water_constants_mod.o <- jules/src/params/standalone/water_constants_mod_jls.F90
...
[FAIL] ! water_constants_mod.mod: depends on failed target: water_constants_mod.o
[FAIL] ! water_constants_mod.o: update task failed
[FAIL] fcm make -f /work/scratch-pw/mtodt/cylc-run/u-bv464/work/1/fcm_make/fcm-make.cfg -C /home/users/mtodt/cylc-run/u-bv464/share/fcm_make -j 4 # return-code=255

The error messages pop up for lots of modules, not just water_constants_mod. I assume there's something I've overlooked while changing to SLURM? Something like a link to a library that has to be changed as well? Thanks a lot for your help!

Cheers
Markus

Change History (16)

comment:1 Changed 5 months ago by pmcguire

  • Cc mcguirepatr@… removed
  • Component changed from Rose/Cylc to JULES
  • Keywords SLURM, cylc, added; SLURM cylc removed
  • Owner changed from um_support to jules_support

comment:2 Changed 5 months ago by pmcguire

Hi Markus:
You probably need to update your MPI NETCDF libraries.
We have been doing that in the way outlined in this CMS Helpdesk ticket:
http://cms.ncas.ac.uk/ticket/3377#comment:3
You can look at the Rose/Cylc suite history of the suite (u-bx723) mentioned in that ticket to see the changes necessary to get a gridded JULES suite ported from LSF to SLURM.
More details will be sent to the jules-users email list soon, I hope.
Patrick

comment:3 Changed 5 months ago by mtodt

Hi Patrick

Thanks a lot for the link!
I couldn't find any related changes in u-bx723 or its predecessor u-bx722, though. But I noticed a difference to my suite in suite.rc under JASMINenv-script, so I added those lines to my suite. The previous error doesn't seem to occur anymore, but fcm_make still fails:

[FAIL] mpif90 -obin/jules.exe o/jules.o -L/tmp/6I7zFCLAYF -ljules -L/gws/nopw/j04/jules/jules_build/libs/lib -lnetcdff -lnetcdf -lhdf5_hl -lhdf5 -lcurl -heap-arrays -fp-model precise -traceback -L/gws/nopw/j04/jules/jules_build/libs/lib -lnetcdff -lnetcdf -lhdf5_hl -lhdf5 -lz -lm # rc=1
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libnetcdf.a(libdispatch_la-dfile.o): in function `NC_open':
[FAIL] dfile.c:(.text+0x1ac6): undefined reference to `hpmp_comm_world'
[FAIL] ld: dfile.c:(.text+0x1aff): undefined reference to `hpmp_char'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libnetcdf.a(libnetcdf4_la-nc4file.o): in function `NC4_create':
[FAIL] nc4file.c:(.text+0x125): undefined reference to `hpmp_comm_world'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libnetcdf.a(libnetcdf4_la-nc4file.o): in function `NC4_open':
[FAIL] nc4file.c:(.text+0x3bb9): undefined reference to `hpmp_comm_world'
[FAIL] ld: nc4file.c:(.text+0x3c5a): undefined reference to `hpmp_char'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libnetcdf.a(libnetcdf4_la-nc4hdf.o): in function `nc4_put_vara':
[FAIL] nc4hdf.c:(.text+0x9fa): undefined reference to `hpmp_unsigned'
[FAIL] ld: nc4hdf.c:(.text+0xa09): undefined reference to `hpmp_max'
[FAIL] ld: nc4hdf.c:(.text+0xa10): undefined reference to `hpmp_f_mpi_in_place'
[FAIL] ld: nc4hdf.c:(.text+0xa7d): undefined reference to `hpmp_int'
[FAIL] ld: nc4hdf.c:(.text+0xa83): undefined reference to `hpmp_bor'
[FAIL] ld: nc4hdf.c:(.text+0xa8a): undefined reference to `hpmp_f_mpi_in_place'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5AC.o): in function `H5AC_receive_and_apply_clean_list':
[FAIL] H5AC.c:(.text+0x3507): undefined reference to `hpmp_int'
[FAIL] ld: H5AC.c:(.text+0x355f): undefined reference to `hpmp_byte'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5AC.o): in function `H5AC_broadcast_clean_list':
[FAIL] H5AC.c:(.text+0x3e98): undefined reference to `hpmp_int'
[FAIL] ld: H5AC.c:(.text+0x3fd4): undefined reference to `hpmp_byte'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5AC.o): in function `H5AC_propagate_and_apply_candidate_list':
[FAIL] H5AC.c:(.text+0x495a): undefined reference to `hpmp_int'
[FAIL] ld: H5AC.c:(.text+0x4ab5): undefined reference to `hpmp_byte'
[FAIL] ld: H5AC.c:(.text+0x4b5e): undefined reference to `hpmp_int'
[FAIL] ld: H5AC.c:(.text+0x4bcf): undefined reference to `hpmp_byte'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Dmpio.o): in function `H5D__mpio_opt_possible':
[FAIL] H5Dmpio.c:(.text+0xf7): undefined reference to `hpmp_int'
[FAIL] ld: H5Dmpio.c:(.text+0x101): undefined reference to `hpmp_bor'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Dmpio.o): in function `H5D__contig_collective_read':
[FAIL] H5Dmpio.c:(.text+0x448): undefined reference to `hpmp_byte'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Dmpio.o): in function `H5D__contig_collective_write':
[FAIL] H5Dmpio.c:(.text+0x8f8): undefined reference to `hpmp_byte'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Dmpio.o): in function `H5D__multi_chunk_collective_io':
[FAIL] H5Dmpio.c:(.text+0xf35): undefined reference to `hpmp_byte'
[FAIL] ld: H5Dmpio.c:(.text+0xffe): undefined reference to `hpmp_byte'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Dmpio.o): in function `H5D__inter_collective_io':
[FAIL] H5Dmpio.c:(.text+0x1ed7): undefined reference to `hpmp_byte'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Dmpio.o):H5Dmpio.c:(.text+0x25ea): more undefined references to `hpmp_byte' follow
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Dmpio.o): in function `H5D__link_chunk_collective_io':
[FAIL] H5Dmpio.c:(.text+0x3ac5): undefined reference to `hpmp_int'
[FAIL] ld: H5Dmpio.c:(.text+0x3acb): undefined reference to `hpmp_sum'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Dmpio.o): in function `H5D__chunk_collective_read':
[FAIL] H5Dmpio.c:(.text+0x3ca5): undefined reference to `hpmp_int'
[FAIL] ld: H5Dmpio.c:(.text+0x3cb0): undefined reference to `hpmp_sum'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Dmpio.o): in function `H5D__chunk_collective_write':
[FAIL] H5Dmpio.c:(.text+0x3fe5): undefined reference to `hpmp_int'
[FAIL] ld: H5Dmpio.c:(.text+0x3ff0): undefined reference to `hpmp_sum'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Eint.o): in function `H5E_walk1_cb':
[FAIL] H5Eint.c:(.text+0x1c6): undefined reference to `hpmp_comm_world'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Eint.o): in function `H5E_walk2_cb':
[FAIL] H5Eint.c:(.text+0x3e8): undefined reference to `hpmp_comm_world'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5FDmpio.o): in function `H5FD_mpio_open':
[FAIL] H5FDmpio.c:(.text+0x146c): undefined reference to `hpmp_comm_self'
[FAIL] ld: H5FDmpio.c:(.text+0x1578): undefined reference to `hpmp_byte'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5FDmpio.o): in function `H5FD_mpio_read':
[FAIL] H5FDmpio.c:(.text+0x1cad): undefined reference to `hpmp_byte'
[FAIL] ld: H5FDmpio.c:(.text+0x1d1c): undefined reference to `hpmp_byte'
[FAIL] ld: H5FDmpio.c:(.text+0x1fd8): undefined reference to `hpmp_byte'
[FAIL] ld: H5FDmpio.c:(.text+0x2069): undefined reference to `hpmp_byte'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5FDmpio.o):H5FDmpio.c:(.text+0x238d): more undefined references to `hpmp_byte' follow
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5FDmpiposix.o): in function `H5FD_mpiposix_open':
[FAIL] H5FDmpiposix.c:(.text+0x8ae): undefined reference to `hpmp_comm_self'
[FAIL] ld: H5FDmpiposix.c:(.text+0x977): undefined reference to `hpmp_byte'
[FAIL] ld: H5FDmpiposix.c:(.text+0x9f4): undefined reference to `hpmp_byte'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Smpio.o): in function `H5S_mpio_space_type':
[FAIL] H5Smpio.c:(.text+0x9dd): undefined reference to `hpmp_byte'
[FAIL] ld: H5Smpio.c:(.text+0xea9): undefined reference to `hpmp_byte'
[FAIL] ld: H5Smpio.c:(.text+0x1a30): undefined reference to `hpmp_byte'
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Smpio.o):H5Smpio.c:(.text+0x1be6): more undefined references to `hpmp_byte' follow
[FAIL] ld: /gws/nopw/j04/jules/jules_build/libs/lib/libhdf5.a(H5Smpio.o): in function `H5S_mpio_space_type':
[FAIL] H5Smpio.c:(.text+0x1d08): undefined reference to `hpmp_lb'
[FAIL] ld: H5Smpio.c:(.text+0x1d14): undefined reference to `hpmp_ub'
[FAIL] ld: H5Smpio.c:(.text+0x21a1): undefined reference to `hpmp_byte'
[FAIL] link      33.5 ! jules.exe            <- jules/src/control/standalone/jules.F90
[FAIL] ! jules.exe           : update task failed

[FAIL] fcm make -f /work/scratch-pw/mtodt/cylc-run/u-bv464/work/1/fcm_make/fcm-make.cfg -C /home/users/mtodt/cylc-run/u-bv464/share/fcm_make -j 4 # return-code=255

Can you be more specific about what exactly I have to add to my suite that's not in the help pages and updated tutorials? Thanks a lot!

Cheers
Markus

comment:4 Changed 5 months ago by pmcguire

Hi Markus
yes, those changes to the JASMIN->env_script are the main thing I was talking about.
Did you do a rose suite-run --new?
That should set you up from scratch.
Patrick

comment:5 Changed 5 months ago by mtodt

Oh, ok. I've resubmitted the suite with rose suite-run --new, let's see what happens once it's actually running. Thanks a lot!

Cheers
Markus

comment:6 Changed 5 months ago by pmcguire

I'm testing it too with your suite.
I did change the JULES FORTRAN source code about 10-15 minutes ago in /gws/nopw/j04/jules/pmcguire/jules_build/jules-vn4.9_positiverain/etc/fcm-make/platform/jasmin-lotus-intel.cfg to point to the new & correct jules GWS for the curl library.
If your fcm_make hasn't started running yet, it should pick up the change.
Patrick

comment:7 Changed 5 months ago by pmcguire

Hi Markus:
It's also possible that your fcm_make app for the GL6R WFDEI irrigation u-bv464 suite is configured differently than the fcm_make app is for the GL7 N96 SLURM suite u-bx723. It's possible/probable that the settings you changed in the suite.rc file for u-bv464 to match u-bx723 get overwritten in the app/fcm_make directory of u-bv464.
Patrick

comment:8 Changed 5 months ago by pmcguire

Hi Markus:
I tried to run your suite a couple of times. But it fails with a submit_failed during fcm_make. It doesn't even try to compile.
Do you get the fcm_make app to start running in the short-serial queue?
I think the GL7 suite u-bx723 compiles as a background task instead of in the short-serial queue, but I don't know if that matters.
Patrick

comment:9 Changed 5 months ago by mtodt

Hi Patrick

Thanks! Yes, it got submitted this morning (after submitting yesterday evening) but failed with what seems to be the same error message.

I see what you mean regarding app/fcm_make/rose-app.conf, but when I compare its content

[env]
JULES_BUILD=normal
JULES_COMPILER=intel
#JULES_FFLAGS_EXTRA=-I/gws/nopw/j04/jules/jules_build/libs/include -O3 -xHost -ip -no-prec-div -static-intel -lz -lm
JULES_FFLAGS_EXTRA=-I/gws/nopw/j04/jules/jules_build/libs/include -ip -no-prec-div -static-intel -lz -lm
JULES_LDFLAGS_EXTRA=-L/gws/nopw/j04/jules/jules_build/libs/lib -lnetcdff -lnetcdf -lhdf5_hl -lhdf5 -lz -lm
JULES_MPI=mpi
JULES_NETCDF=netcdf
JULES_NETCDF_INC_PATH=/gws/nopw/j04/jules/jules_build/libs/include
JULES_NETCDF_LIB_PATH=/gws/nopw/j04/jules/jules_build/libs/lib
JULES_NETCDF_PATH=/gws/nopw/j04/jules/jules_build/libs
JULES_OMP=noomp
!!JULES_PLATFORM=jasmin-lotus-intel
!!JULES_REMOTE=local
!!JULES_REMOTE_HOST=localhost
#JULES_SOURCE=/gws/nopw/j04/jules/albmar/jules_build/jules-vn4.5/trunk
JULES_SOURCE=/gws/nopw/j04/jules/pmcguire/jules_build/jules-vn4.9_positiverain

with what I added to suite.rc

                eval $(rose task-env)
                export PATH=/apps/jasmin/metomi/bin:$PATH
                module load intel/19.0.0
#                module load contrib/gnu/gcc/8.2.0
                module load contrib/gnu/gcc/7.3.0
                module load eb/OpenMPI/intel/3.1.1
#                module add parallel-netcdf/intel
                module list 2>&1
                env | grep LD_LIBRARY_PATH
                export NETCDF_FORTRAN_ROOT=/home/users/siwilson/netcdf_par/3.1.1/intel.19.0.0/
                export NETCDF_ROOT=/home/users/siwilson/netcdf_par/3.1.1/intel.19.0.0/
                export HDF5_LIBDIR=/home/users/siwilson/netcdf_par/3.1.1/intel.19.0.0/lib
#                module load intel/19.0.0
                export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
                export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HDF5_LIBDIR
                env | grep LD_LIBRARY_PATH

there's no overlap. So I don't see what would get overwritten.

Looking at the job.err file the problem seems to stem from, or be related to, JULES_NETCDF_PATH and JULES_LDFLAGS_EXTRA in app/fcm_make/rose-app.conf, though. How were those set in u-bx723 if not in app/fcm_make/rose-app.conf? Should I replace them or comment them in my suite?

Cheers
Markus

comment:10 Changed 5 months ago by mtodt

Addendum:

I see in suite u-bx723 that app/fcm_make/file/fcm-make.cfg additionally contains

build.prop{fc.include-paths} =  /home/users/siwilson/netcdf.openmpi/include
build.prop{fc.lib-paths} =  /home/users/siwilson/netcdf.openmpi/lib /gws/nopw/j04/jules/admin/curl/curl-lotus-parallel-intel/lib

That would be the updates for JULES_NETCDF_INC_PATH and JULES_NETCDF_LIB_PATH (although I don't know how to prescribe the two lib paths in my file/format). Should I just replace them in my app/fcm_make/rose-app.conf? But there are more env variables for which I don't have updates.

Cheers
Markus

comment:11 Changed 5 months ago by pmcguire

Hi Markus:
Yes, I think that is what you can try. Just replace JULES_NETCDF_INC_PATH and JULES_NETCDF_LIB_PATH in your GL6R_irrigation u-bv464 /app/fcm_make/rose-app.conf with the values from GL7 u-bx723/app/fcm_make/file/fcm-make.cfg. I think those might be the only updated values.

You can compare to the values that are normally supplied in:
/gws/nopw/j04/jules/pmcguire/jules_build/jules-vn4.9_positiverain/etc/fcm-make/platform/jasmin-lotus-intel.cfg
and
/gws/nopw/j04/jules/jules_build/vn5.8/etc/fcm-make/platform/jasmin-lotus-intel.cfg.

Patrick

comment:12 Changed 5 months ago by mtodt

Thanks! I've given that a try and submitted the suite again.

Cheers
Markus

comment:13 Changed 5 months ago by mtodt

Hi Patrick

My submission failed with the following error message:

cpu-bind=MASK - host142, task  0  0 [19495]: mask 0xffff set FAILED
slurmstepd: error: Failed to invoke task plugins: task_p_pre_launch error

That's the same I got for a post-processing script, for which I have opened this ticket. I assume that's a JASMIN issue, and not an error I made?

Cheers
Markus

comment:14 Changed 5 months ago by pmcguire

Hi Markus:
Is it working now?
I sometimes get those slurmstepd, task_p_pre_launch errors too right now for various different jobs. Sometimes just rerunning the job allows the job to start and finish properly.
Patrick

comment:15 Changed 5 months ago by mtodt

Hi Patrick

Yes, it's working now. As you said, it seems like starting a job again solves the latest problem. Sorry for only closing the other ticket, and thanks a lot for helping me with this! I understand there's been a lot of work for you JASMIN and CMS people. Really appreciate it!

Cheers
Markus

comment:16 Changed 5 months ago by mtodt

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.