Opened 9 months ago

Closed 9 months ago

#2727 closed help (answered)

Recon taking hours on the queue

Reported by: cbellisario Owned by: um_support
Component: UM Model Keywords: queue
Cc: Platform: NEXCS
UM Version:

Description

Dear NCAS team,

I was wondering about the queuing time of the model on NEXCS (2) which increased a lot recently.
Since early December, where short runs (5-days runs to check the modifications on the code) used to take from 30 min to 1 hour from the start to the end, I get now at least 4 hours before "recon" get running.
(suites u-bd497/u-be515 currently running for example)

Is this normal and due to a larger demand of computation time?
Or something as changed?

If so, would it be possible to know about the peak time of queuing so I can arrange my day-schedule appropriately?

Thank you in advance,

Best regards,

Christophe

Change History (8)

comment:1 Changed 9 months ago by grenville

Christophe

You can run the reconfiguration in the shared queue - please try changing the RCF_RESOURCE section to

RCF_RESOURCE?

inherit = HPC
[directives?]

-l select={{NODE_RCF}}:ncpus={{APPN}}:coretype={{CORE}}
-q=shared

[environment?]

ROSE_LAUNCHER_PREOPTS = -ss -n {{TASKS_RCF}} -N {{2*(TPNUMA_RCF|int)}} -S {{TPNUMA_RCF}} -d {{OMPTHR_RCF}} -j {{HYPTHR_RCF}}
ROSE_LAUNCHER_ULIMIT_OPTS = -s unlimited -c unlimited
ROSE_LAUNCHER = aprun

[job?]

execution time limit = PT15M

This will only work if you reconfiguration fits on one node - which is the case here.

Grenville

Version 0, edited 9 months ago by grenville (next)

comment:2 Changed 9 months ago by cbellisario

Dear Grenville,

I did applied the changes (on suite be-515 and bd497) but the queuing time for reconfiguration itself still take few hours to run.

Christophe

comment:3 Changed 9 months ago by cbellisario

Hi Grenville,

this is Simon & Christophe…

Given that Christophe has already reconfigured the initial file we think that he can speed things up by not rerunning reconfiguration. So in the days of the UMUI you would do this by turning of reconfiguration and making the start file the .astart/.ostart file produced by the reconfiguration.

With ROSIE neither of us see how to do that — we can see where AINITIAL is defined. There does not appear to be a .astart file in the HISTORY_DATA directory or a way of specifying the initial file for the atmosphere model. We can see how to turn off reconfiguration.

The suite we are working with is: u-be515
Two questions:
1) Where is the .astart file (output from reconfiguration) after the model has ran?
2) How do we specify the initial file for the atmosphere model? We think it is at um→namelist→Reconfig and ..→ General technical … (leastways that is where AINITIAL appears to be used).

A more general question. reconfiguration is a very quick process taking < 1 min but Q time is ages. Back in November Q time for reconfig was minutes but now it is hours. Is there an equivalent of the ARCHER serial Q which is restricted to one node jobs?

appreciate a rapid answer to this as Christophe will be working remotely from Paris from Thursday PM.

Simon & Christophe

comment:4 Changed 9 months ago by ros

Hi Christophe, Simon,

The name of the atmosphere start dump is specified in the variable astart in panel "um→ namelist → Model input and output → dumping and meaning". This variable also serves as the output file name for the reconfiguration. You've already run the reconfiguration for this suite so all you need to do is turn off reconfiguration and just run the model. The suite will pick up the file already produced by the reconfiguration; i.e. /home/d04/chrbe/cylc-run/u-be515/share/data/History_Data/be515a.da19880901_00

Looking at the last run you did for the reconfiguration for u-be515 it was still submitted to the normal queue rather than the shared one that Grenville gave instructions for. If you're going to run reconfiguration again - please double check the changes you made. The shared queue is the equivalent of the ARCHER serial queue.

Regards,
Ros.

comment:5 Changed 9 months ago by cbellisario

Dear Ros,

Thank you for your answer about the reconfiguration and astart variable.

Regarding the configuration, I did apply the changes as suggested by Grenvile to meto_cray.rc. As I thought this was the one used. Shall I do the changes on each file or MONSooN.rc?

Best regards,

Christophe

comment:6 Changed 9 months ago by ros

Hi Christophe,

Sorry for the delay, you've probably figured this out by now, but yes the changes Grenville suggested should be made to the MONSooN.rc file.

Regards,
Ros.

comment:7 Changed 9 months ago by cbellisario

Dear Ros,

Actually, since last Monday, the model is back to its ~45 min run as it was before December, so I do not mind launching again the reconfiguration.
As I still get model output differences between my classic run and my modified runs, I prefered to keep the reconfiguration going on and change small steps by small steps.

Thank you again for your help,

Best regards,
Christophe

comment:8 Changed 9 months ago by grenville

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.