Opened 8 months ago

Last modified 6 months ago

#3475 new help

JULES and SLURM queueing time on JASMIN

Reported by: heatherrumbold Owned by: um_support
Component: JASMIN Keywords: JULES, SLURM, JASMIN
Cc: Platform: JASMIN
UM Version:

Description

Hi Patrick,

I hope you this email finds you safe and well. I have a question regarding running JULES on Jasmin. I am trying to run a suite on Jasmin with lots of serial single site jules tasks. I have found that the individual sites run extremely slow or just submit and never actually run within the timeclock? This same suite runs all 170 sites on Spice in the Met Office in under an hour. I realise that there maybe issues with Jasmin as per the email received from the help desk today, but are there any specific things I should be setting. Is this just a feature of running in serial or do you know of anything else I can try in order to speed things up?

The suite I’m trying to set up is the prototype for a possible future JULES benchmarking suite (u-bx465). … I have been using your instructions and example suites to set up the suite.rc and I have something which works but I’m not sure I have the optimal settings for SLURM or the datasets in the right place.

Many thanks,
Heather

Change History (23)

comment:1 Changed 8 months ago by pmcguire

Hi Heather:
Do you have an NCAS CMS Helpdesk account? Can you make a ticket on the CMS Helpdesk about this?

What queue are you using? short-serial or short-serial-4hr?
Have you looked to see how many jobs are in the queues?
What wallclock time are you requesting?
Are you using the new SLURM NETCDF libraries? (see u-al752 for example).

Patrick

comment:2 Changed 8 months ago by pmcguire

Hi Patrick,

Thanks for your quick reply. I don’t have a Helpdesk account but I have cc’d the Helpdesk email address so hopefully someone is able to create one for me please 😊

What queue are you using? short-serial or short-serial-4hr?

– I’ve tried both

Have you looked to see how many jobs are in the queues?

– How can I check this?

What wallclock time are you requesting?

—time = 2:00:00

Are you using the new SLURM NETCDF libraries? (see u-al752 for example).

  • I have the same paths as this suite.


This is what I have in my suite.rc…

    [[JASMIN]]
        script = " rose task-run --quiet --path=share/fcm_make/build/bin "
        env-script = """
                eval $(rose task-env)
                export PATH=/apps/jasmin/metomi/bin:$PATH
                module load intel/19.0.0
#                module load contrib/gnu/gcc/8.2.0
                module load contrib/gnu/gcc/7.3.0
                module load eb/OpenMPI/intel/3.1.1
#                module add parallel-netcdf/intel
                module list 2>&1
                env | grep LD_LIBRARY_PATH
                export NETCDF_FORTRAN_ROOT=/home/users/siwilson/netcdf_par/3.1.1/intel.19.0.0/
                export NETCDF_ROOT=/home/users/siwilson/netcdf_par/3.1.1/intel.19.0.0/
                export HDF5_LIBDIR=/home/users/siwilson/netcdf_par/3.1.1/intel.19.0.0/lib
#                module load intel/19.0.0
                export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
                export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HDF5_LIBDIR
                env | grep LD_LIBRARY_PATH
                """
        [[[job]]]
            submission polling intervals = PT1M
            execution polling intervals = PT1M
 
    [[JASMIN_LOTUS]]
        inherit = None, JASMIN
        [[[directives]]]
            --partition = short-serial
            --constraint = "ivybridge128G"
        [[[job]]]
            batch system = slurm
 
    [[JASMIN_BACKGROUND]]
        inherit = None, JASMIN 
        [[[job]]]
            batch system = background
 
    [[FCM_MAKE_JASMIN]]
        inherit = None, JASMIN_BACKGROUND
        [[[environment]]]   
            JULES_PATH = {{ JULES_PATH }}
            NETCDF_FORTRAN_ROOT=/home/users/siwilson/netcdf.openmpi/
            NETCDF_ROOT=/home/users/siwilson/netcdf.openmpi/
 
 
    [[JULES_JASMIN]]
        inherit = None, JASMIN_LOTUS
                [[[directives]]]
            --time = 2:00:00
            --ntasks = 1
                    [[[environment]]]
            JULES_DATA_DIR = {{ JULES_DATA_DIR }}
            EXPT = {{ EXPT }}
            JULES_PATH = {{ JULES_PATH }}
            DATA_DIREC= "/gws/nopw/j04/jules/data/PLUMBER2/2020-11.01/"
            ANCIL_DIREC= "/gws/nopw/j04/jules/data/PLUMBER2/2020-11.01/"
            OUTPUT_FOLDER =  "{{ JULES_DATA_DIR }}/output_{{ EXPT }}/$SITE"
         [[[parameter environment templates]]]
            SITE = '%(site)s'


Many thanks,
Heather

comment:3 Changed 8 months ago by pmcguire

Hi Heather
Have you tried running recently? I ran a set of 36 SLURM array jobs on short-serial-4hr today, and they only queued for a few seconds before running.
You can find the jobs in the queue with: sprio -n -p short-serial-4hr.
If you locate your jobID in the queue, then you can see how many other jobs have a higher priority than yours.
Patrick

comment:4 Changed 8 months ago by pmcguire

Hi Patrick,

I’m just trying it now. I was just reading this so hopefully this has sorted it!
https://www.ceda.ac.uk/blog/problem-with-lotus-batch-scheduler-slurm/

How can I find out what my job id is?

Thanks,

Heather

comment:5 Changed 8 months ago by pmcguire

Hi Heather
There are many ways to figure out what your job id is.
One way is to use:
squeue -u hashton001
Patrick

comment:6 Changed 8 months ago by pmcguire

Hi Patrick,

Thanks, I think I’ve identified it now as it appears in the cylc gui.
I have 25 sites submitted and waiting (40356707 - 40356736) in the short-serial queue, apparently queued behind 6 higher priority tasks. Would this be enough to stop my tasks submitting? My tasks were submitted at 10:50 and are still waiting. Any thoughts before I terminate this and try again on the short-serial-4hr.

I think I have an account on the help desk now, would you like me to transfer this query over there?


Many thanks,
Heather

comment:7 Changed 8 months ago by pmcguire

Hi Heather:
I will try to transfer this email chain over to a ticket on your behalf.

I see that you switched to the short-serial-4hr queue, and that your jobs have already started running.

Were there only 6 higher priority tasks than yours in the short-serial queue? There are several [thousand] running right now, with a few hundred waiting in the queue.
I was able to start 36 jobs in 3 different job-arrays on short-serial maybe 30 minutes ago, and they immediately started running.

Patrick

comment:8 Changed 8 months ago by pmcguire

  • Reporter changed from pmcguire to heatherrumbold
  • Summary changed from heatherrumbold to JULES and SLURM queuing time on JASMIN

comment:9 Changed 8 months ago by pmcguire

  • Summary changed from JULES and SLURM queuing time on JASMIN to JULES and SLURM queueing time on JASMIN

comment:10 Changed 8 months ago by heatherrumbold

Thank you for opening the ticket Patrick!

They are running on the short-serial-4hr but very slowly! These runs complete in minutes in the Met Office so I’m a bit suspicious as they’ve now been there >45 mins!

I also can’t see anything on the short-serial-4hr queue:
hashton001@cylc1 u-bx465]$ sprio -n -p short-serial-4hr

JOBID PARTITION PRIORITY AGE ASSOC FAIRSHARE TRES

There were definitely only 6 higher priority tasks higher than mine in the short-serial queue, they looked like they had been there a while and mine didn’t submit in the few hours they were waiting.

comment:11 Changed 8 months ago by pmcguire

Hi Heather:
Have you aver tried to run the u-al752 JULES FLUXNET suite on JASMIN?
There's a tutorial for it here:
https://research.reading.ac.uk/landsurfaceprocesses/software-examples/tutorial-rose-cylc-jules-on-jasmin/

That u-al752 suite does a lot of JULES runs for FLUXNET sites, and the runs don't take over an hour or so, if I remember right, if everything is working.
That suite uses the same SLURM libraries that you are using on JASMIN. My colleagues set up those libraries.

Normally, if people need help with their suites which we can't figure out in email or ticket posts, we ask the reporter to give read permissions (for all JASMIN users usually) of the reporter's home directory and the roses and cylc-run directories, so that we can better decipher what's going on. Can you set your permissions to be that way? If you have any private or confidential items in your home directory, you should/could put them in a non-world-readable subdirectory.
Patrick

comment:12 Changed 8 months ago by heatherrumbold

I haven't tried running u-al752 yet but have been using the suite.rc as a guide for my suite:
suite.rc.CEDA_JASMIN

Permissions have been changed, so you should have the required access now.
Cheers,
Heather

comment:13 Changed 8 months ago by heatherrumbold

Hi Patrick,
I just wondered whether you have had a chance to try running this yourself yet? I have had a further look at u-al752 but can't spot anything obvious that I haven't tried already. The only thing I noticed was in one of Karina's comments she mentions something about data being in the right location for it to run quickly? Karina has the data for each site contained in the suite itself whereas I have stored it on Jasmin (driving data and ancillaries - /gws/nopw/j04/jules/data/PLUMBER2/2020-11.01/). Would the location of the driving data be likely to cause any issues here? Also my output is going to here - /home/users/hashton001/jules_output/u-bx465/output_GL8p1/$SITE_NAME - Is this the best place for it to go?
Many thanks,
Heather

comment:14 Changed 8 months ago by pmcguire

Hi Heather:
I have been studying your u-bx723 suite (in subdirectories of your roses and cylc-run directories). Thanks for changing the permissions.
It does look like your JULES runs work for 2 hours before they reach their wall-clock limit. If you can tolerate runs that are longer than two hours, you might consider increasing the wall-clock time from 2 hours.

One thought comes to mind about your driving data that you ask about. Your driving data and ancillaries are on the NOPW drive ("no parallel write"). Since you're reading and not writing the data, then that's OK. But it is a slower drive, and if you want higher performance, especially if you're reading files from the same directories from different JULES runs on different SLURM batch processors , then you might consider putting the driving data either in your home directory or in the scratch-pw space. There might be another high-performance PW GWS that you can get access to, but you might try your home directory or the scratch-pw drive first for the driving data.

You might also try running your suite for only one FLUXNET site instead of all of them, and see if it is still slow running the JULES.

I should also note, that the short-serial-4hr queue picked up (some of) the JULES jobs from this suite (my copy of your roses/u-bx465) very quickly when I started running the suite. Some of the jobs have failed already after 4 minutes or so. It might be because you commented out the --constraint = "ivybridge128G" for the processor type for the batch processing, and SLURM might be then trying to run it on a different processor type than what it was compiled with, for some fo the jobs. There are some AMD processors in there, if I recall correctly. The other submitted jobs are still running, and there are other JULES jobs that are waiting to be submitted
Patrick

comment:15 Changed 8 months ago by heatherrumbold

Hi Patrick,
Many thanks for taking a look. I have managed to get it running and have noted a few things based on your comments:

  • I have had to use the short-serial queue as --constraint = "ivybridge128G" doesn't appear to available on the short-serial-4hr queue.
  • I am running 1 site at a time for now but I don't think it makes any difference to how long each sites takes to run.
  • It runs through the spin up very quickly (order of minutes) and then takes considerably longer to do the main run. I don't understand this because the spin up is basically doing the whole of the main run! This possibly suggests that it's the writing out of the output that is taking the time, as this is the only thing the spin up and main run do that is different.
  • Setting the output directory to /work/scratch-pw/hashton001/jules_output/u-bx465/ instead of my home directory has decreased the runtime from several hours to < 6 minutes!
  • I realise this is a temporary drive so in the long term I might need to get permssions to write to /gws/pw/j05/. Do you know if there are plans to have a jules workspace here? (or I could just leave it up to the user to copy the data to somewhere more permanent).
  • Moving the driving data to my home directory made very little difference to run time.

So in conclusion it seems it's the location of the output directly that is important here and a parallel write drive is essential for a quicker runtime!

I still need to increase the number of sites to be run at once and hopefully I can get something which is comparable to that that we run on MO SPICE. I will be in touch shortly via email about potentially getting yourself or Pier Luigi to test this as a prototype JULES benchmarking suite, would that be ok?

Many thanks for all your help!
Heather

comment:16 Changed 8 months ago by heatherrumbold

Quick update - I've upped the queue limit to 50 and most sites are running and 83 (out of 170) have succeeded. However a few fail halfway through with this error:

slurmstepd: error: * JOB 42293497 ON host096 CANCELLED AT 2021-03-03T15:22:19 DUE TO NODE FAILURE, SEE SLURMCTLD LOG FOR DETAILS *
2021-03-03T15:22:27Z CRITICAL - failed/EXIT

Both incidences so far are on host096, is there a known issue with this node?

comment:17 Changed 8 months ago by pmcguire

Hi Heather:
I do think that short-serial-4hr has a 1-2 nodes of its allocation with an ivybridge128G or another ivybridge type of processor. Previously, the rest of them were AMD type or something, I think. I had previously requested to JASMIN support that more ivybridge nodes be added to short-serial-4hr. I am not sure if any have been added or not.

That's a superb demonstration of your of the speedup of using scratch-pw for non-parallel output over the home directory. I think the home directory also has parallel-write capability, but it might be lower performance than scratch-pw.

I don't think there is a jules GWS (or plans for one) on the parallel-write partition /gws/pw/j05.

The parallel-write disks are much more expensive than the non-parallel-write disks.

Historically, I have run JULES on JASMIN with much of my output going to scratch. This has been for either the gridded JULES runs or the multiple FLUXNET single-point JULES runs. I later copied that output from scratch to a GWS (it could be a nopw GWS). I don't think it's unreasonable to ask this of the users.

Do you have GWS access on a nopw partition besides the jules nopw GWS? The jules GWS access on the nopw partition is not really the best place to put model output (especially large amounts of model output), due to limited space.

I have heard about or seen node failures in the past, but I didn't know about host096 in particular. Maybe you can report that to the JASMIN Helpdesk?

Yes, please do get in contact with me and Pier Luigi about the possibility of testing your suite.
Patrick

comment:18 Changed 7 months ago by heatherrumbold

Hi Patrick,
I seem to be having issues with queuing times again. I can't seem to get any task from my suite beyond the queuing stage on the short-serial queue (it times out after 24 hours) and the short-serial-4hr starts fine (i.e. short queue) but then fails with the following error:

Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2 and POPCNT instructions.

[host642.jc.rl.ac.uk:47987] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 532
[host642.jc.rl.ac.uk:47987] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 166
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[host642.jc.rl.ac.uk:47987] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[FAIL] rose-jules-run <<'__STDIN__'
[FAIL] 
[FAIL] '__STDIN__' # return-code=1
2021-03-24T11:24:55Z CRITICAL - failed/EXIT 

Do you have any ideas what might be going on here, before I report it to the Jasmin help desk?

There seems to be an enormous amount of tasks queuing on the short-serial right now, is there any way I can up the priority on my suite, I seem to always be at the bottom of the queue and barely moving up?

Asides from these issues on Jasmin (which may not be anything to do with my suite specifically!), I have the first official version ready for you to trial and feedback on if you are still happy to do so? I will be in touch via email shortly.

Many thanks,
Heather

comment:19 Changed 7 months ago by pmcguire

Hi Heather:
I will try to write more when I can.
But the short-serial-4hr queue uses a lot of AMD processors. There are only a few if any ivybridge processors there. You need to make sure you compile JULES on the same node-type as you run JULES on.
Patrick

comment:20 Changed 7 months ago by heatherrumbold

Hi Patrick,
I thought I was but I'll run rose suite-run —new just in case Rose has got itself in a muddle. I'm not having much luck running on either queue at the moment so I will try again later. However, I've just tried the test queue for a dozen or so sites and they run fine so at least I can confirm there is nothing wrong with the jules part of the suite.
Many thanks,
Heather

comment:21 Changed 7 months ago by pmcguire

Hi Heather:
Are things working better now for you?
Patrick

comment:22 Changed 6 months ago by heatherrumbold

Hi Patrick,

Both my jasmin suites are running now but I've have had to resort to using the test queue in both cases in order to test that they work as all other queues were taking too long and the suite was timing out. Most recently I have been testing my GL9 stanadard suite (CRU-NCEP N96), which ran quickly when I first set it up back in November last year but now spends over 24 hours queuing. Have you or any other users experienced the same level of time queuing or is it just me!? Is it just that there has been a massive increase in the number of users on jasmin or has the Met Office been bumped to the bottom of the queue! I plan to push both of these suites out to the JULES community soon but not sure if that will be sensible if the queuing takes this long. I'm hoping it's just me and and the way I've set things up! If you have any thoughts please let me know!

Many thanks,
Heather

comment:23 Changed 6 months ago by pmcguire

Hi Heather
I am glad that both of your JASMIN suites are running now.

Yes, long queueing times on JASMIN are rather common, since the SLURM upgrade last fall. But it's quite possible to sometimes get in the queues somewhat quickly, depending on the load.

I guess your suite starts 170 short-serial jobs, since that's how many FLUXNET sites that are in your list. That number of sites might take awhile to get through the short-serial queue. But there are other people who are submitting more than 170 jobs at once into the short-serial queue.

Probably, if you can do it, as I suggested before, you can use the short-serial-4hr queue, if the job will take less than 4 hours. But then it might be more difficult to make sure JULES is compiled on the same processor type as it runs on, since there are fewer Intel processors in the short-serial-4hr queue. The queueing time for short-serial-4hr is often less than it is for the short-serial queue. If you're careful about what processor type that JULES is compiled on, then you can use --partition = short-serial-4hr,short-serial or something like that, and it will go into the first queue available.
Patrick

Note: See TracTickets for help on using tickets.