Work undertaken to optimise Pier Luigi Vidale's N512 GA7 runs by NCAS-CMS has resulted in a 15-20% speed up of the model with no loss of bit-comparison. The savings in ARCHER resource resulting from this effort are very significant - at ARCHER partner rate this is ~£60k, and twice that at the non-partner rate. Another way of looking at this is that we save ~100M AU (for these experiments) which enables a raft of ARCHER projects which may struggle otherwise.
In the standard GA7, stochastic physics is turned on. This holds a field in spectral space distributed over the PEs. Every timestep this field is gathered by PE0, converted to grid space and then distributed over the domain. This requires a gather and many scatters at everytime timestep. A month long N512 high resolution job was examined with drhoook and it was noted that FOR_PATTERN, the routine which does the gather/scatter and spectral→grid space transformation takes a significant proportion of the run time, 513s for a 3000s run.
FOR_PATTERN has been rewritten to remove all gathers and scatters of the spectral field. Instead of having the spectral rows distributed over all PEs, each PE holds the spectral rows equivalent to its own rows in grid space. It then does a local fourier transform to get back to grid space, and then extracts the its own longitude domain from the resultant field.
This requires extra compute as every PE over a latitude band does the same spectral→grid space transformation, but the savings communication times can far outweigh this. Gather/scatters when running on a large number (1000s) of PEs are best avoided as these can take many milliseconds to perform.
With it new code FOR_PATTERN takes 23s, compared with 513s.
Two test jobs have been run. A high resolution N512 and an AMIP N96 configuration. For both there is full bit compariability with the previous version.
The N512 GA7 is now being run on Archer with this branch. The speed increase, from Pier Luigi Vidale:
So, for two domain decompositions and for a 2-month dump at N512: 48x48: ~7 hours, but a few times as short as 6hrs40mins (down from ~8 hours) 48x72: 4hrs50m to 5hrs02m (down from ~6hours)
So a ~15%-20% speed up
For the AMIP GA7 runs, there was no discernable speed change, but this isn't surprising as the gather/scatter would be much faster whilst running on a low number of PEs