Opened 5 months ago

Closed 3 months ago

#3390 closed help (fixed)

Running JULES on SLURM on Jasmin

Reported by: sarahchadburn Owned by: jules_support
Component: JULES Keywords: JULES, SLURM, MPI, NETCDF, JASMIN
Cc: Platform: JASMIN
UM Version:

Description

Dear Patrick,

Since the lotus queues are closing down and none of my jobs will run any more, I am attempting to set up my suite to run on SLURM. I am stuck with an error (pasted below the email) relating to netcdf, and when I google "slurm netcdf jasmin", your name appears! So I wondered if you could help me…?

Do you have a suite that is running on slurm that I could have a look at to see what I might be doing wrong?

Thank you very much in advance!
All the best,
Sarah

[FAIL] /home/users/schadburn/cylc-run/slurm_test1/share/fcm_make/preprocess/src/jules/src/io/file_handling/core/drivers/ncdf/driver_ncdf_mod.F90(139): error 7002: Error in opening the compiled module file. Check INCLUDE paths. [NETCDF]
[FAIL] USE netcdf, ONLY: &
[FAIL] ——

Change History (28)

comment:1 Changed 5 months ago by pmcguire

Hi Sarah:
Can you try to make the changes as summarized in the changeset in this comment of the CMS Helpdesk ticket below?
http://cms.ncas.ac.uk/ticket/3376#comment:21
Those have the changes to make it work on SLURM.
Those use the links to the MPI NETCDF libraries before they were copied to the jules GWS, but they should still work. I will try to check in the changes to use the jules GWS version of the MPI NETCDF libraries
Can I make a new ticket for you for this and put it on the NCAS CMS Helpdesk?
Do you already have a CMS Helpdesk account?
Patrick

comment:2 Changed 5 months ago by pmcguire

Hi Patrick,

Thanks a lot, this thread contained what I was looking for: a suite with the updated "env-script" (bx723). I copied it from this file
https://code.metoffice.gov.uk/trac/roses-u/browser/b/x/7/2/3/trunk/suite.rc

And now it seems to be working! At least, JULES has compiled, and it's submitted the runtime jobs but not started running, probably because the queues are clogged up at the minute.

I don't have a CMS Helpdesk account as far as I know.

Thanks again!
Sarah

comment:3 Changed 5 months ago by pmcguire

Hi Sarah:
I am glad it worked so far!
Yes, that env script is where Simon Wilson and Dave Case updated the MPI NETCDF libraries. I will try to update that script with the jules GWS copies soon.
I don't know exactly what your suite is doing, but there might be other changes besides the env-script that could be needed to get things working efficiently on JASMIN.

Can I request a CMS Helpdesk account be made for you?

Patrick

comment:4 Changed 5 months ago by pmcguire

Hi Patrick,

Thanks, yes please do request a helpdesk account!

In case it tells you anything, the jasmin-specific part of my suite now looks like this…pasted below (the fcm part runs instantly and the jules part hasn't started)

Best,
Sarah

    {% if LOCATION=='jasmin' %}
        [[linux]]
            env-script = """
                eval $(rose task-env)
                export PATH=/apps/jasmin/metomi/bin:$PATH
                module load intel/19.0.0
#                module load contrib/gnu/gcc/8.2.0
                module load contrib/gnu/gcc/7.3.0
                module load eb/OpenMPI/intel/3.1.1
#                module add parallel-netcdf/intel
                module list 2>&1
                env | grep LD_LIBRARY_PATH
                export NETCDF_FORTRAN_ROOT=/home/users/siwilson/netcdf_par/3.1.1/intel.19.0.0/
                export NETCDF_ROOT=/home/users/siwilson/netcdf_par/3.1.1/intel.19.0.0/
                export HDF5_LIBDIR=/home/users/siwilson/netcdf_par/3.1.1/intel.19.0.0/lib
#                module load intel/19.0.0
                export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
                export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HDF5_LIBDIR
                env | grep LD_LIBRARY_PATH
                         """
            [[[job]]]
                submission polling intervals = PT1M
                execution polling intervals = PT1M
            [[[environment]]]
                JULES_FCM      = {{ JULES_FCM_JAS }}

        [[fcm_make]]
            inherit = None, linux
            [[[job]]]
                batch system = background
                execution time limit = PT10M
            [[[directives]]]
                -p = short-serial
                --time = 00:30
                -n = 1
            [[[environment]]]
                JULES_BUILD      = normal
                JULES_OMP        = noomp
                JULES_SOURCE     = $JULES_FCM$AT_JULES_REVISION
                JULES_PLATFORM = jasmin-lotus-intel

        [[jules]]
            script = "rose task-run --path=share/fcm_make/build/bin"
            inherit = None, linux
            [[[job]]]
                batch system = slurm
            [[[environment]]]
                MPI_NUM_TASKS   = 1
                OMP_NUM_THREADS = 1
                NPROC           = $MPI_NUM_TASKS
                ROSE_TASK_APP   = jules
                ROSE_LAUNCHER   = mpirun.lotus
            [[[directives]]]
                --partition = short-serial
                --time = 23:50
                --ntasks = 1
    {% endif %}

comment:5 Changed 5 months ago by pmcguire

Hi Sarah:
Did the JULES runs finish on SLURM? How does the output look?

You could maybe though change the Wallclock time from 23:50 to 24:00. The SLURM scheduler should still accept that despite the 24:00 limit.

And for the background fcm_make, I don't think it gets submitted to SLURM, so maybe the reference to the short-serial is not needed. But I am not 100% sure about that.

And maybe you can use mpirun instead of mpirun.lotus? That is a change that DaveC made in the suite, which maybe you didn't notice.

Patrick

comment:6 Changed 4 months ago by sarahchadburn

Hi Patrick,

Thanks a lot for following up on this, I've been busy with other things so I've only just got around to it again…
And now I actually can't log in to Jasmin at all (see below)! Any idea about this one?

Thanks again,
Sarah

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: POSSIBLE DNS SPOOFING DETECTED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The RSA host key for login.jasmin.ac.uk has changed,
and the key for the corresponding IP address 130.246.130.165
is unknown. This could either mean that
DNS SPOOFING is happening or the IP address for the host
and its host key have changed at the same time.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
The fingerprint for the RSA key sent by the remote host is
be:da:07:5d:72:6c:a2:2e:54:06:b1:3a:e0:99:ae:e7.
Please contact your system administrator.
Add correct host key in /home/links/sec234/.ssh/known_hosts to get rid of this message.
Offending key in /home/links/sec234/.ssh/known_hosts:17
RSA host key for login.jasmin.ac.uk has changed and you have requested strict checking.
Host key verification failed.

comment:7 Changed 4 months ago by pmcguire

Hi Sarah:
I don't think login.jasmin.ac.uk exists. It should be login1.jasmin.ac.uk.
Does that help?
Patrick

comment:8 Changed 4 months ago by pmcguire

Hi Sarah:
It looks like login.jasmin.ac.uk is an alias for login1.jasmin.ac.uk.
I tried the former with ssh, and I was able to log in, and it logged me in to the latter address.
So I am not sure why it's not working.

Maybe you can try to delete your entries for login.jasmin.ac.uk and login1.jasmin.ac.uk in your known_hosts file and try again?

Patrick

comment:9 Changed 4 months ago by sarahchadburn

Thanks Patrick. It actually just stopped doing that of its own accord, so perhaps someone really was "doing something nasty"

Anyway, now that I am back online, I tried changing the mpirun.lotus to mpirun but either way, I get this identical error message from the JULES runs, which now are immediately failing (apologies I didn't pick this up before because they were queueing):

"Please verify that both the operating system and the processor support Intel® X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2 and POPCNT instructions.

[FAIL] rose-jules-run <<'STDIN'
[FAIL]
[FAIL] 'STDIN' # return-code=1
2020-10-20T10:00:12Z CRITICAL - failed/EXIT"

comment:10 Changed 4 months ago by sarahchadburn

PS I have followed all the instructions on this page: https://code.metoffice.gov.uk/trac/jules/wiki/RoseJULESonJASMIN

and it still doesn't ask me for my MOSRS password when I log into cylc1.jasmin.ac.uk. I'm not sure what I'm doing wrong. Any help appreciated.

Thanks a lot,
Sarah

comment:11 Changed 4 months ago by pmcguire

Hi Sarah:
It looks like you have an extra line in your .bash_profile file with:

[[ $- != *i* ]] && return # Stop here if not running interactively

Does it help to get rid of that line?
Patrick

comment:12 Changed 4 months ago by sarahchadburn

Thanks for the suggestion. I have tried removing this from both .bash_profile and .bashrc (in turn), which does not seem to change anything. I presume this is in response to my comment about the MOSRS password and I wouldn't expect this to help with the JULES runs?

Thanks again!
Sarah

comment:13 Changed 4 months ago by pmcguire

Hi Sarah:
Yes, that is for the MOSRS password. You need that line in the .bashrc file and not the .bash_profile file, like in the docs.
You can compare to my ~pmcguire/.bashrc and ~pmcguire/.bash_profile if you want.
Patrick

comment:14 Changed 4 months ago by pmcguire

Hi Sarah:
To mention again, for the JULES runs: for the "background" fcm_make, I don't think it gets submitted to SLURM, so maybe the reference to the short-serial queue is not needed. But I am not 100% sure about that. Your file u-an231_slurm/include/jasmin/suite.rc suggests that you're trying to compile on the short-serial LOTUS queue.

Also, the short-serial queue might still be overloaded at the moment. The JASMIN folks recently created a short-serial-4hr queue (less than 4-hour jobs), that might have much quicker turnaround.

Patrick

comment:15 Changed 4 months ago by sarahchadburn

Hi Patrick,

Thanks for the reply! The fcm_make runs fine (but I will remove that reference anyway, just to be sure), and the JULES jobs are no longer queueing, so I don't think short-serial is overloaded. They start straight away but just give the error message in my post above!

Cheers,
Sarah

comment:16 Changed 4 months ago by pmcguire

Hi Sarah:
The fcm_make might appear to work fine, but the executable that it creates may possibly not be compatible with LOTUS & SLURM if it is being compiled with the short-serial queue. JASMIN has a heterogeneous architecture, and some jobs get run on different processor types, unless specified. If the JULES is compiled on one processor type and you try to run that executable on a different processor type, then you might get errors like the ones you show, especially if there is some sort of compiler optimization enabled.
Patrick

comment:17 Changed 4 months ago by sarahchadburn

Ah, interesting! It is running now without the short-serial reference so I'll let you know how it goes.

Thanks,
Sarah

comment:18 Changed 4 months ago by pmcguire

Hi Sarah:
Excellent!
I also tried to run a copy of your u-an231_slurm suite. And I get the same errors.

You might want to edit your u-an231_slurm/include/jasmin/suite.rc file so that it uses the short-serial-4hr queue instead of the short-serial queue. The short-serial has been taking 12-24 hours or more to start running jobs lately, and the short-serial-4hr queue is much quicker now.
Patrick

comment:19 Changed 4 months ago by sarahchadburn

Hi Patrick,

Unfortunately I am getting exactly the same error.
So I actually tried compiling on short-serial, thinking that it might then be consistent with *running* on short serial. It gives me the same error messages in fcm_make, but more of them. For example:

"[FAIL] mpif90 -oo/water_constants_mod.o -c -DSCMA -DBL_DIAG_HACK -DINTEL_FORTRAN -I./include -I/home/users/siwilson/netcdf_par/3.1.1/intel.19.0.0include -heap-arrays -fp-model precise -traceback /home/users/schadburn/cylc-run/slurm_test4/share/fcm_make/preprocess/src/jules/src/params/standalone/water_constants_mod_jls.F90 # rc=1
[FAIL]
[FAIL] Please verify that both the operating system and the processor support Intel® X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2 and POPCNT instructions."

I wonder if mpif90 is somehow the issue here…? Apologies I don't have a good understanding of such things.

Thank you for all the help.
Sarah

comment:20 Changed 4 months ago by pmcguire

Hi Sarah:
I also see that the libraries are not defined for fcm_make:
These two lines need to be added to the end of your u-an231_slurm/app/fcm_make/file/fcm-make.cfg file:

build.prop{fc.include-paths} = /home/users/siwilson/netcdf.openmpi/include

build.prop{fc.lib-paths} = /home/users/siwilson/netcdf.openmpi/lib /gws/nopw/j04/jules/admin/curl/curl-lotus-parallel-intel/lib

You also need to have the NETCDF libraries defined in your file u-an231_slurm/include/jasmin/suite.rc:

        [[fcm_make]]
            inherit = None, linux
            [[[job]]]
                batch system = background
                execution time limit = PT10M
            [[[directives]]]
                --time = 00:30
                -n = 1 
           [[[environment]]]
                JULES_BUILD      = normal
                JULES_OMP        = noomp
                JULES_SOURCE     = $JULES_FCM$AT_JULES_REVISION
                JULES_PLATFORM = jasmin-lotus-intel
                NETCDF_FORTRAN_ROOT=/home/users/siwilson/netcdf.openmpi/
                NETCDF_ROOT=/home/users/siwilson/netcdf.openmpi/

Does that help?
Patrick

comment:21 Changed 4 months ago by sarahchadburn

Thanks for these, I've put them in, but sadly I still get exactly the same errors when I compile on 'background'.

I am trying again to get fcm_make to run on short-serial, thinking that this might help with consistency. However it keeps timing out after a couple of minutes, which is weird given that I've asked for 30 minutes (and I assume that the PT10M thing means 10 minutes?)

"cpu-bind=MASK - host624, task 0 0 [1441]: mask 0x20 set
slurmstepd: error: * JOB 19984948 ON host624 CANCELLED AT 2020-10-20T16:28:51 DUE TO TIME LIMIT *
2020-10-20T15:28:51Z CRITICAL - failed/EXIT"

Any thoughts?

Thanks a lot,
Sarah

comment:22 Changed 4 months ago by pmcguire

Hi Sarah:
It looks like all the processors/nodes in the short-serial-4hr queue are of the epyctwo1024G AMD type.

The ones in the the short-serial queue are of more varied types, including ivybridge, haswell, and broadwell.
The ivybridge for example is an Intel type of processor.
But we need to queue longer to get running in the short-serial queue.

You can see the processor types for different LOTUS/SLURM nodes with this command:
sinfo -l -N | grep short-serial

The sci2 virtual machine (VM) uses Intel processors, as you can see with :
more /proc/cpuinfo
when you are logged in to sci2.

Since we can't currently specify the processor type to be Intel for the short-serial-4hr queue and since the MPI libraries were compiled with the Intel compiler, one suggestion I have is that you can use the short-serial queue and the ivybridge128G processor type both for fcm_make and for jules. But the queueing time might be a while.
I made some other changes as well:

[[fcm_make]]
            inherit = None, linux
            [[[job]]]
                batch system = slurm
                execution time limit = PT30M
            [[[directives]]]
                --partition = short-serial
                --constraint = ivybridge128G
                --time = 00:30
                --ntasks = 1
            [[[environment]]]
                JULES_BUILD      = normal
                JULES_OMP        = omp
                JULES_SOURCE     = $JULES_FCM$AT_JULES_REVISION
                JULES_PLATFORM = jasmin-lotus-intel
                NETCDF_FORTRAN_ROOT=/home/users/siwilson/netcdf.openmpi/
                NETCDF_ROOT=/home/users/siwilson/netcdf.openmpi/
 [[jules]]
            script = "rose task-run --path=share/fcm_make/build/bin"
            inherit = None, linux
            [[[job]]]
                batch system = slurm
            [[[environment]]]
                MPI_NUM_TASKS   = 1
                OMP_NUM_THREADS = 1
                NPROC           = $MPI_NUM_TASKS
                ROSE_TASK_APP   = jules
                ROSE_LAUNCHER   = mpirun
                ROSE_LAUNCHER_PREOPTS = -n $MPI_NUM_TASKS
                ROSE_LAUNCHER_ULIMIT_OPTS = -s unlimited -c unlimited
            [[[directives]]]
                --partition = short-serial
                --constraint = ivybridge128G
                --time = 00:30
                --ntasks = 1

Yes, you probably want to switch to PT30M.

I am trying this now.
Does it work for you?
Patrick

comment:23 Changed 4 months ago by sarahchadburn

Hi Patrick,

Thanks for all the info! I've tried this (copied what you have here) and I still get the weird timeout error, after 1 minute.
"slurmstepd: error: * JOB 20020923 ON host223 CANCELLED AT 2020-10-20T20:26:37 DUE TO TIME LIMIT"

Very strange…

comment:24 Changed 4 months ago by sarahchadburn

Aha, I have figured out this part! 00:30 means 30 seconds according to slurm (while it meant 30 minutes for lotus). So I've changed those to 00:30:00 and now I don't get the timeout error, and fcm_make successfully runs on short-serial.

comment:25 Changed 4 months ago by pmcguire

Hi Sarah:
Yes, I just figured out that out too! I was just logging in to tell you!

If short-serial takes too long to wait for the queued job to start running, then you might be able to try the test queue instead of short-serial. That is working for me right now. But it only allows 8 jobs per user.
Patrick

comment:26 Changed 4 months ago by sarahchadburn

And now JULES is running too!! Fantastic. It didn't like the two lines you added:

ROSE_LAUNCHER_PREOPTS = -n $MPI_NUM_TASKS
ROSE_LAUNCHER_ULIMIT_OPTS = -s unlimited -c unlimited

I think this is because it doesn't seem to under stand the shortenings to "-s" "-n", etc, it wants you to write it in full (for ex. —ntasks=)
But I just removed those two lines and it ran. Which is a huge relief.

Thank you for all the help today!
Sarah

comment:27 Changed 4 months ago by pmcguire

You're welcome, Sarah!
I had commented out the ROSE_LAUNCHER_ULIMIT_OPTS = -s unlimited -c unlimited line too, and it ran.

Do you know about rose suite-run --reload? That's a way to reload things after making changes.
It can help to skip the compiling for example.
Patrick

comment:28 Changed 3 months ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.