#3050 closed help (fixed)

Slow progress in NEXCS queue

Reported by: apm Owned by: um_support
Component: NEMO/CICE Keywords: queue, latency, CMIP6
Cc: julien.palmieri@…, axy@…, colin.jones@… Platform: NEXCS
UM Version:

Description

I am running the ¼° OMIP contribution to CMIP6 on NECXS under the nexus-n05 account as Rose suite u-bf847. Unfortunately I am getting extremely slow turnaround: once a monthly cycle of the model has started, it only takes 45 minutes, but in practice throughput is only a half or a third of this because the job tasks (both model and postprocessing) can sit in the queue for several hours. I moved this suite over from the Monsoon platform a month or so ago for this very reason, but I have not found much of an improvement.

This is a high-priority model integration, but at this rate it will not be complete by the CMIP6 deadline at the end of this year. Is there any way to speed up its progress through the system? Could it be submitted to a higher priority queue?

Thank you,

Alex

Change History (9)

comment:1 Changed 11 months ago by willie

Hi Alex,

You could change the cycling frequency from 1 month to 3 months. This would reduced the amount of queuing by a factor of three. The three month cycle would take roughly 3x 45 minutes which is with the queue limit.

Willie

comment:2 Changed 11 months ago by apm

Thanks, Willie.

I did try running with three-month cycles a while ago, but it crashed at the first restart - if I remember correctly, the timestepping had somehow got confused. I'll try again, and let you know what happens.

Regards,

Alex

comment:3 Changed 11 months ago by willie

Hi Alex,

With the current configuration, 213 years in one month cycles each taking 45 minutes amounts to 86.6 days of solid computing. If started now it would finish 18/Jan/2020 assuming no queuing at all. Allowing a one hour queue would add in a further delay of 213x12 hours or 107 days. Currently the queuing is about 12 hours. So there is no way the result can be achieved in the desired time. Even with my suggested tripling of the cycling time, it could not be met.

There are a few problems here. One is the account is overdrawn by a considerable margin and this is one reason why the queuing is so long. The other is that the model is using 52 nodes. This may not be the optimum configuration but some work would be needed to find this.

Willie

comment:4 Changed 11 months ago by apm

Thanks again, Willie.

The job appears to be running fine now with three-month cycles, although it's still too early to verify any consistent improvement. As I said, I did try this before, but the failures I encountered on that occasion convinced me that changing the cycle basis during a run was not a trivial thing, though I was clearly wrong. I have had a quick look at the archived files, and the annual means appear to be being generated correctly.

What do you mean that the account is overdrawn? Which account is this - do you mean nexcs-05 or omip? What timescale is the charging made on? I will need to speak to the appropriate person about the possibility of changing the account the run is charged to.

Alex

Last edited 11 months ago by apm (previous) (diff)

comment:5 Changed 11 months ago by willie

Hi Alex,

Apparently, the account nexcs-05 has had more than its "fair share" - see https://collab.metoffice.gov.uk/twiki/bin/view/Support/MONSooNFAQs#What%20affects%20the%20prioritisation

Willie

comment:6 Changed 11 months ago by apm

I don't see the nexcs-05 account in this list in the document you linked to:

https://collab.metoffice.gov.uk/twiki/bin/viewfile/Static/SystemMonitoring/Reports/fairshare_boost.monsoon2.txt

Am I looking in the wrong place?

I have used nexcs-n01 in the past - would it be worth switching to this account?

Alex

comment:7 Changed 11 months ago by willie

Alex,

The NEXCS document is at https://collab.metoffice.gov.uk/twiki/bin/viewfile/Static/SystemMonitoring/Reports/fairshare_boost.nexcs.txt

nexcs-n01 is even more negative, so no.

Willie

comment:8 Changed 11 months ago by apm

OK, thanks.

The run with three-month cycling seems to have quite a low latency right now, with the model restarting within half of hour of submission in the last four jobs. If I have persistent trouble again I will investigate the possibility of running under a different account.

Regards,

Alex

comment:9 Changed 11 months ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.