Opened 6 months ago

Closed 2 months ago

#3391 closed help (answered)

WFDEI JULES suites on SLURM on JASMIN?

Reported by: rfu Owned by: jules_support
Component: JULES Keywords: SLURM, JULES, JASMIN, WFDEI
Cc: Platform: JASMIN
UM Version:

Description

Hi Patrick,

[Can I get some help with converting my JULES suites to SLURM on JASMIN?]
I hadn’t even started looking yet at how to get things running with SLURM, I had been relying on the old LSF system that still seemed to be running last time I tried, but I don’t know if that’s still available now.

Thanks,

Becky

Change History (11)

comment:1 Changed 6 months ago by pmcguire

Hi Becky:
Can you try to make the changes as summarized in the changeset in this comment of the CMS Helpdesk ticket below?
http://cms.ncas.ac.uk/ticket/3376#comment:21
Those have the changes to make it work on SLURM.
Those use the links to the MPI NETCDF libraries before they were copied to the jules GWS, but they should still work. I will
try to check in the changes to use the jules GWS version of the MPI NETCDF libraries
Can I make a new ticket for you for this and put it on the NCAS CMS Helpdesk?
Patrick

comment:2 Changed 6 months ago by pmcguire

Thanks Patrick,
I will give this a go. Yes, please make a ticket for it.
Becky

comment:3 Changed 6 months ago by pmcguire

Hi Becky:
Were you able to try this out?
Patrick

comment:4 Changed 6 months ago by rfu

Hi Patrick,

I am trying it out right now! The suite compiles successfully on SLURM, it then tries to submit the first part of the job but just sits doing nothing, so I am unsure if it is all working ok at the moment. How long should it take from compiling to the job being submitted, is it usual on SLURM for the job to sit doing nothing for ages (so far all morning)?

Thanks,
Becky

comment:5 Changed 6 months ago by pmcguire

Hi Becky:

Are you using the par-multi queue? I am not sure how long the waiting times are in the par-multi queue with SLURM. How many processors are you requesting? Are you asking for exclusive access to a node?

The short-serial queue has had long wait-times recently after the SLURM changeover, as much as 1 day or more. But the JASMIN folks have been trying to shorten those queue waiting-times.

Patrick

comment:6 Changed 6 months ago by rfu

Hi Patrick,

I think the wait time yesterday was a JASMIN thing. Today the suite is submitting ok without waiting around for too long and appears to be running - thank you!! Can I ask though, I am submitting to the par-multi queue, but what should I set MPI_NUM_TASKS and OMP_NUM_THREADS to, currently I have 10 and 1 respectively, is that right?

Thanks,
Becky

comment:7 Changed 6 months ago by pmcguire

Hi Becky:
I am glad the queuing time has gone down. Hopefully it's running OK.

MPI_NUM_TASKS=10 and OMP_NUM_THREADS=1 is what was working before the SLURM upgrade for the GL7 suite. When they were trying to run it with MPI_NUM_TASKS=16 with par-single, the queueing time was extremely long. I think when we were running the GL6R suite with par-multi and MPI_NUM_TASKS=32 before SLURM, it was working ok and the queueing time has reasonable.

After the SLURM upgrade, MarkusT was able to get the GL6R suite for the China region running 3x faster (35 hrs vs 100 hrs) with 16 processors in par-multi. It may be important to compile and run with the --constraint = ivybridge128G.
Patrick

comment:8 Changed 6 months ago by rfu

Hi Patrick,

Hope you are well. I noticed you have some things running on the cylc1 machine – is it behaving for you? Thanks to your help I have managed to change some of my JULES rose suites to run with slurm – at least they run and generate output which I take as a good sign, the problem at the moment is that for some reason (today and yesterday) they are failing for no apparent reason. The runs are failing with no errors, they just get kicked off after doing a few years of the run. It is driving me insane!! Is this just me or are you having issues too?

Thanks,

Becky

comment:9 Changed 6 months ago by pmcguire

Dear Becky:
Are your jobs still failing on cylc1?
What messages did you get when your jobs are kicked off? Anything?
You said there were no error messages (which I assume you mean in the .err files),
but maybe there was something in the .out or job-activity.log or .status files?
Patrick

comment:10 Changed 4 months ago by grenville

  • Status changed from new to pending

comment:11 Changed 2 months ago by ros

  • Resolution set to answered
  • Status changed from pending to closed
Note: See TracTickets for help on using tickets.