Opened 3 years ago

Closed 3 years ago

#2455 closed help (fixed)

slow job throughput

Reported by: marcus Owned by: um_support
Component: NEXCS Keywords: queueing time
Cc: luke.abraham@… Platform: NEXCS
UM Version: 10.6


Hi, I am currently running two hindcast simulations (u-aw103 and u-aw881) on NEXCS.

Since yesterday these spend most of their time queueing whereas before they ran virtually instantly. A query with Monsoon showed that they are not aware of any HPC issues, the fairshare allocation looks good, and they referred me therefore to check with NCAS if there are any problems.

Restarting the jobs has resulted in no improvement, they were first 'held' in gcylc and upon triggering them to run now they are in the queue since four hours. Is this a sudden increase in demand?

Many thanks,

Change History (8)

comment:1 Changed 3 years ago by marcus

Quick update: All of a sudden the queueing time for my jobs is between 8 hours and 20 hours. Am I the only one who experiences this increase?

comment:2 Changed 3 years ago by willie

Hi Marcus,

There are no problems that we're aware of. The queuing time for jobs depends on the resources requested. Asking for a large number of nodes for a long time is likely to result in a long queue time. Obviously it depends on what other users are doing too.


comment:3 Changed 3 years ago by marcus

Hi Willie,

Thank you, is there a way for me to see what other users are doing? It's only that for the past 3 weeks this job (and all previous jobs) ran, virtually without queueing time, model cycle after cylce on NEXCS. Now, suddenly I am waiting up to 20 hours between cycles. Is it realistic that there has been such a dramatic increase in HPC activity, this seems hard to believe. What could I check to make sure it's not my job?

Many thanks,

comment:4 Changed 3 years ago by marcus

If this persists we will not make our targets in time for deadlines. So far the model ran approximately 1 model year per day. Now it runs one model-month per day. My 60 year hindcast simulation will take forever, there must be a problem. I just don't know what to check.

comment:5 Changed 3 years ago by ros

Hi Marcus,

qstat is saying that the collaboration trust zone is currently running to full capacity. I can see that nexcs-n02 is currently receiving a fairshare negative 3 hour boost which will also impact things. It is also possible that usage of the "research" trustzone (Met Office internal) may have increased which NEXCS/Monsoon jobs can be scheduled to run on if there is space available. If we find out anything else we'll let you know.


comment:6 Changed 3 years ago by ros

Hi Marcus,

We suggest that you try changing your suite to run in the 24hour queue (long24) rather than the 4hour (normal) one. You will obviously need to adjust the wallclock time requested and the cycling frequency of the suite accordingly.


comment:7 Changed 3 years ago by marcus

OK, thank you Ros, I will try this.

comment:8 Changed 3 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.