Opened 3 years ago

Closed 3 years ago

#2551 closed help (fixed)

Suite suddenly running out of time

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7



This question sort of relates to ticket number #2546, although I have experienced this error before but never managed to resolve it.

=>> PBS: job killed: walltime 86487 exceeded limit 86400

For unknown reasons, a suite that previously ran fine (u-aw739) is now running out of wall clock time. The new suite is u-az608. I have made a small change to the aerosols going into this suite, but other than this nothing has changed since u-aw739. So why is it taking slightly more (i.e. 87 seconds) than it should? Before, it was completing its cycle of 4 years within ~20 hours so significantly under the 24 hour limit. Why, now, with just a change to the ancillary files, is it so much slower? The same thing happened a couple of weeks ago when I tried to run u-aw739 but with elevated CO2 - again, only one tiny change, but again it ran out of time.



Change History (6)

comment:1 Changed 3 years ago by willie

Hi Charlie,

How long did u-aw739 take? If it's very close to the wall time there could be trouble. If other users are hammering the disk, then your job will have to wait and wait until there is a gap. I normally would add 20% to the actual run time to allow for this.


comment:2 Changed 3 years ago by charlie

Hi Willie,

Originally, by which I mean about 2 months ago, u-aw739 took just under 20 hours to do its first complete cycle (4 model years of output, plus postprocessing and archiving). So well within the limit of 24 hours. However, I haven't rerun this recently, so it's possible that, due to heavier usage now, if I was to run this again it would also be over the limit. If I check my most recent suite, u-az608, it does actually produce all the output it's meant to (i.e. all 4 years) but then fails literally a few seconds later, before it moves on to the postprocessing and archiving stage.

It strikes me there are 2 possibilities: a) either it's now taking longer because of heavier usage (and therefore u-aw739 would also take longer and would also fail), or b) it's taking longer because of the modifications I made to the aerosol emissions files used in u-az608.

I guess there are 2 ways of finding this out: either run u-aw739 again and see if it still completes a cycle within 20 hours (which would discount a)), or run u-az608 again but with a shorter runtime e.g. only requesting a cycle of 2 years, and see exactly how long it takes to do this?

Which do you think it would be the best way forward?


comment:3 Changed 3 years ago by charlie

Hi again,

Further to this, I have now tried running this suite for just 3 years (with 1 year cycling, still in the 24 hour queue), and the good news is that it seems to have worked correctly. So the aforementioned problem must have been just to do with heavier usage, meaning my suite was just taking slightly too long when running with four-year cycling in the 24 hour queue.

Please can you advise on where I need to look to find out exactly how long a one year cycle takes, not just the atmos_main but the entire cycle (i.e. including postprocessing and archiving)?



comment:4 Changed 3 years ago by ros

Hi Charlie,

The time taken for each part of the cycle are found in the relevant task job.out files at the bottom.

The time limits set for the compile, atmos_main, post-processing and transfer tasks can (and usually are) all different. You don't need to set the limit for the entire cycle.


comment:5 Changed 3 years ago by charlie


comment:6 Changed 3 years ago by charlie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.