Opened 9 months ago

Closed 8 months ago

#2818 closed help (answered)

Best configuration for cycling time, wallclock time, total run length and pptransfer.

Reported by: cbellisario Owned by: ros
Component: UM Model Keywords: cycling time, wallclock time, run length, pptransfer
Cc: Platform: NEXCS
UM Version:

Description

Dear team,
To start with, I have a “philosophical” question about the best configuration for cycling time, wallclock time and total run length.
As I am currently trying to run a whole year (still facing troubles, see http://cms.ncas.ac.uk/ticket/2816#ticket), I have set the:

  • Total run length = P1Y
  • Cycling frequency = P1M
  • Wallclock time = PT3H

Does the cycling frequency should be optimised for the run length? Is the cycling frequency of P1M good for a run length of P1Y? I also had concern about the wallclock time as I first thought that it was related to the total run length. So I set up to PT20H but the run failed to submit. So I assume that the wallclock time is related to the time of one cycle?
I also intend to run for 10 years. So should I do a cycling frequency of P1Y instead of a cycling frequency of P1M?
I raise these questions since due to problem with JASMIN servers, the run was stuck in the post processing part (pptransfer), stopping the um runs.
This lead to the next question.
When JASMIN (and therefore the post processing) was not working, the um in the following cycling was working, but not the one after. Is there a command that prevent the UM of a specific cycle to run depending on the results of previous parts? If so, is there a way to force the UM to run even if the previous post processing have not been performed?
And last question related to pptransfer failing on JASMIN, it was related to parallel writing. So is pptransfer using parallel writing?

Thank you for your answers,
Best regards,

Christophe

Change History (2)

comment:1 Changed 9 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Christophe,

Firstly the wallclock time applies to the atmos_main task for a single cycle.

So you need to find out how long 1 month takes - you can see this at the end of the job.out file. Then calculate how many months you can fit into the queue length (e.g. if you're running in the normal queue on NEXCS (max wallclock 4hours) and 1 model month takes 1hour wallclock then you'd set the cycling frequency to be P3M to allow for slight fluctuation.)

By default, cylc allows suites to have a maximum of 3 cycles active at one time this prevents tasks running ahead out of control piling up data. Thus if a task in a cycle fails and you don't fix it then the model will only run for another 2 cycles. You can increase this if you need to by adding max active cycle points to the suite.rc file in the [scheduling] section:

[scheduling]
    ...
    final cycle point   = +{{RUNLEN}}-PT1S
    max active cycle points = 6

I would NOT recommend going above 6, otherwise you will have data piling up on the /projects disk and the model will obviously crash if it runs out of disk space.

If you are writing to a "no parallel write" JASMIN GWS and using gridftp then you will need a fix to pptransfer to stop it doing parallel writes.

Please include the branch:

fcm:moci.xm-br/dev/rosalynhatcher/postproc_2.2_pptransfer_gridftp_nopw@3209

in the fcm_make_pp sources.

Regards,
Ros.

comment:2 Changed 8 months ago by ros

  • Platform set to NEXCS
  • Resolution set to answered
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.