Opened 9 years ago

Closed 8 years ago

#525 closed help (fixed)

Variable performance of 4.5.1 on HECToR XT6

Reported by: aschurer Owned by: um_support
Component: HECToR Keywords:
Cc: Platform:
UM Version:

Description

I have just started to run several different HadCM3 jobs on HECToR phase 2b and have noticed that its speed is very variable.
So far I have run 6 very similar jobs and have got average times to calculate a model month varying from 6 mins 10 secs to 9 mins 50 secs all running with 24 processors (4 N-S and 6 E-W):

Details of jobs:

xfhx#g
21353 sec for 59 files average 6:1 mm:ss

first and last files:

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 12 22:25 xfhxga@…

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 13 04:21 xfhxga@…

xfhx#h
29271 sec for 59 files average 8:16 mm:ss

first and last files:

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 13 12:36 xfhxha@…

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 13 20:43 xfhxha@…

xfhv#e
9451 sec for 16 files average 9:50 mm:ss

first and last files:

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 13 14:50 xfhvea@…

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 13 17:27 xfhvea@…

xfhv#f
9133 sec for 18 files average 8:27 mm:ss

first and last files:

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 13 15:55 xfhvfa@…

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 13 18:27 xfhvfa@…

xfhv#g
9319 sec for 21 files average 7:23 mm:ss

first and last files:

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 13 16:51 xfhvga@…

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 13 19:26 xfhvga@…

xfhv#h
9417 sec for 23 files average 6:49 mm:ss

first and last files:

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 13 18:02 xfhvha@…

-rw-r—r— 1 aschurer ecdf_geosciences 16209352 Oct 13 20:39 xfhvha@…

Looking at individual runs there is a vast spread of actual times. eg

-rw-r—r— 1 aschurer ecdf_geosciences 16M Oct 13 18:02 xfhvha@…

-rw-r—r— 1 aschurer ecdf_geosciences 16M Oct 13 18:08 xfhvha@…

-rw-r—r— 1 aschurer ecdf_geosciences 16M Oct 13 18:17 xfhvha@…

-rw-r—r— 1 aschurer ecdf_geosciences 16M Oct 13 18:27 xfhvha@…

-rw-r—r— 1 aschurer ecdf_geosciences 16M Oct 13 18:35 xfhvha@…

i.e. 6 mins, 15 mins, 10 mins, 8 mins….

Is this variance of performance expected? And if so is there anything I can do to improve the speeds of my runs?

Thanks, Andrew

Change History (2)

comment:1 Changed 9 years ago by lois

Hello Andrew,

in this period of severe disruption this level of variance of performance is not unexpected. On the XT4 we have measured the system jitter (variable performance) at about 10% on average and I/O jitter at sometimes as high as 40% for very data intensive runs when many data intensive runs are in the system. At the moment there are many people who are transferring data from XT4 to XT6 and so we could expect that I/O jitter will be high.

We have been exploring issues of performance for the very high resolution runs and so we have an I/O server system which mitigates I/O jitter but unfortunately not for UM 4.5. We have also been looking at the default settings of the MPI environment variables as there is a considerable difference between what we were using on the XT4 and what we perhaps should be using on the XT6. There is still some way to go with this study and Cray have offered to help so we don't have generic advice at the moment. We just need a bit more time!

Lois

comment:2 Changed 8 years ago by ros

  • Resolution set to fixed
  • Status changed from new to closed
  • UM Version <select version> deleted
Note: See TracTickets for help on using tickets.