Opened 12 years ago

Closed 10 years ago

#314 closed help (fixed)

Queue time for HadGEM2-AL60 on HECTOR

Reported by: abozzo Owned by: lois
Component: UM Model Keywords: HadGEM2-AL60 queue
Cc: Platform:
UM Version: 6.1

Description

Hi,

I'm running HadGEM2-AL60 (package from Met Office) on HECTOR: I have to set up a 20y run. I would like to run a continuation run each year, but currently using 64 processor the job's walltime for one month is roughly 2h15m and 12h is not enough for a year run.
Is there any queue dedicated to HadGEM2 for long runs?

Thanks,
Regards
Alessio

Change History (5)

comment:1 Changed 12 years ago by lois

  • Owner changed from um_support to lois
  • Status changed from new to accepted

Hello Alessio, the real answer to your question is no HECToR has a maximum time limit of 12 hours on all the queues. So the question now is, can this performance be improved or will you have to run in 6 month chunks?

In general the goal of most climate modelling groups is to be able to simulate 1000 times reality and from your figures this model setup is far from that (320). Because the UM is a 2D parallel decomposition even if the standard UM Hadgem-A achieved 1000 times, the performance decreases linearly with increasing vertical extent so you might hope for nearer 660 (your month might take ½ the time you report) but you would still be struggling to get 1 year in the 12 hour queue. The way you might mprove the performance of your setup could be use more processors (128) bu N96 doesn't really scale much beyond this. Another way would be to reduce the I/O as this seriously inhibits performance on HECToR, so you could change the frequency of dumps, remove unwanted STASH.

Realistically I think you should be looking to running your 20 year run in 6 month chunks.

Lois

comment:2 follow-up: Changed 12 years ago by abozzo

Hi Lois,

great, many thanks for you answers (and for the correct formulation of the question I should have made)! I'll play with the I/O (as now is probably excessive) and see how far I can get.

Thanks again
Alessio

comment:3 in reply to: ↑ 2 Changed 12 years ago by abozzo

Hi Lois,

I've been trying to improve the model performances. I reduced the dump file frequency (dump each 15 days) and the amount of diagnostics. With 96 processors (4x24) the wall time is now 1h45min for one month. I've done some quick check on the output looking for differences due to changes in the number of processors: i wanted to try a model run with 128 processors, and i found that with 8 processors in the E-W direction and 16 in the N-S the output differs from that computed with 72 processors (4x18) and 96 (4x24). The 96 and 72 versions don't show any difference.

Please find in the following a quick check with the energy balance as found in the leave file:
The numbers in the run with 128 and 64 processors are the same (8 processors E-W) and they differ to what is found in the other 2 jobs (4 processors E-W). Is that related to the number of E-W processors? That number should be as low as possible, does that mean that I should try not to exceed 4 processors?

Many Thanks,
Best Wishes
Alessio

128 processors 8x16 ; wall time 90min/month
grep "ENERGY " xefsf000.xefsf.d09239.t084756.leave | tail -5
ERROR IN ENERGY BUDGET = 0.60533E+20 J/
FINAL TOTAL ENERGY = 0.13071E+25 J/
INITIAL TOTAL ENERGY = 0.13072E+25 J/
CHG IN TOTAL ENERGY OVER Per = -0.10445E+21 J/
ERROR IN ENERGY BUDGET = 0.88339E+20 J/

96 processors 4x24 ; wall time 105min/month
grep "ENERGY " xefse000.xefse.d09238.t141455.leave | tail -5
ERROR IN ENERGY BUDGET = 0.91508E+20 J/
FINAL TOTAL ENERGY = 0.13066E+25 J/
INITIAL TOTAL ENERGY = 0.13069E+25 J/
CHG IN TOTAL ENERGY OVER Per = -0.22667E+21 J/
ERROR IN ENERGY BUDGET = 0.64148E+20 J/

72 processors 4x18 ; wall time 135min/month
grep "ENERGY " xefsf000.xefsf.d09238.t170611.leave | tail -5
ERROR IN ENERGY BUDGET = 0.91508E+20 J/
FINAL TOTAL ENERGY = 0.13066E+25 J/
INITIAL TOTAL ENERGY = 0.13069E+25 J/
CHG IN TOTAL ENERGY OVER Per = -0.22667E+21 J/
ERROR IN ENERGY BUDGET = 0.64148E+20 J/


64 processors 8x8 ; wall time 135min/month
grep "ENERGY " xefse000.xefse.d09237.t230248.leave | tail -5
ERROR IN ENERGY BUDGET = 0.60533E+20 J/
FINAL TOTAL ENERGY = 0.13071E+25 J/
INITIAL TOTAL ENERGY = 0.13072E+25 J/
CHG IN TOTAL ENERGY OVER Per = -0.10445E+21 J/
ERROR IN ENERGY BUDGET = 0.88339E+20 J/

Replying to abozzo:

Hi Lois,

great, many thanks for you answers (and for the correct formulation of the question I should have made)! I'll play with the I/O (as now is probably excessive) and see how far I can get.

Thanks again
Alessio

comment:4 Changed 12 years ago by lois

Hello Alessio

From your description it suggests that this Hadgem2-AL60 job does not bit compare on different number of processors on HECToR. You may want to ask the Met Office if it did on the NEC or was bit compatability ever checked for this particular job.

When we installed UM vn 6.1 on HECToR we took a standard Hadgem1 job and tested that this job bit compared across restarts and on different numbers of processors. We look at the dump at the end of a few days run. Of course this is not a perfect test and it does not guarantee that all jobs using vn6.1 will also bit compare.

You have, in your Met Office job, a large number of changes with respect to our test Hadgem1 job and any of these changes could make your job no longer bit compare. It is quite a challenge to resolve this bit-compatability issue if that is what you decide to do. You could choose not to bother and keep to one processor configuration and run your control and changed experiments always on this particular configuration. There is no definitive answer to your question unfortunately.

Lois

comment:5 Changed 10 years ago by lois

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.