As we move towards the exascale-era, HPC systems are becoming increasingly complex with, for example, co-processors (eg, GPUs, Intel Xeon Phis), Non-uniform memory-access (NUMA) nodes, hyper-threading, and shared floating point units, as well as the trend for increasing numbers of cores with decreasing speeds and decreasing memory per core. This makes it challenging to predict the performance of real applications on new and forthcoming machines.
Similarly scientific applications are evolving in order to solve larger and more intricate problems, which involves exploring more scalable model formulations, numerical methods and algorithms. An example of this is the joint Met Office and NERC project, Gung-Ho!, to develop the next generation Unified Model weather and climate prediction system: http://collab.metoffice.gov.uk/twiki/bin/view/Project/NGWCP
Therefore, we want to know how current and future models will perform on current and emerging systems, in terms of time-to-solution, resource efficiency and problem scalability. Performance modelling can be used as a tool to explore answers to these questions.
In this case a performance model encapsulates information about the software and hardware to make a prediction of elapsed wallclock run time for a given scenario.
Shallow water model
Using a simple shallow water code to simulate some of the execution patterns of a sophisticated climate model, we have been exploring a benchmark driven performance modelling approach to evaluate runtime choices. Benchmarking allows for rapid development of a model without explicitly representing complex features of the hardware, although it does require the application to exist and be runnable on the target machine. It is hoped that this approach will provide an alternative to the expensive and time consuming testing of models by trial to find the ideal configuration of architectural parameters for best performance.
The model has been successfully used to replicate weak and strong scaling experiments up to 16,000 cores and has been used to explore MPI rank to physical core mapping strategies on a Cray XE6 machine with Gemini interconnect and AMD Interlagos processors. In the figure below the model correctly predicts a custom rank mapping that minimises off-node transfers will give a lower communication time that the default SMP-style mapping or a round-robin mapping.
Current work is underway to evaluate the modelling technique by repeating this work on an IBM Power 7 and IBM BlueGene/Q.
Paper: A. Osprey, G. D. Riley, M. Manjunathaiah, and B. N. Lawrence. "A Benchmark-Driven Modelling Approach for Evaluating Deployment Choices on a Multicore Architecture". Proceedings of the International Conference on Parallel & Distributed Processing Techniques & Applications. 2013. pdf
Poster: A. Osprey, G. D. Riley, B. N. Lawrence, and M. Manjunathaiah. "Benchmark-driven performance modelling for multi-core architectures". NCAS staff meeting, June 2012. pdf
Poster: A. Osprey, G. D. Riley, B. N. Lawrence, and M. Manjunathaiah. "Modelling the Performance of a Shallow Water Code". NCAS staff meeting, July 2010. pdf
We looked at the performance of a HadGEM3 configuration at UM 7.3 for N96 and N216 global problem sizes. At the time this work was carried out, there was concern that the UM new dynamics would be limited in it's ability to scale to very high resolution global grids (see for example: http://research.metoffice.gov.uk/research/nwp/publications/mosac/doc-2009-10.pdf)
- How the computational and communication performance of different parts of the model scaled with core count and resolution.
- How the time to complete a timestep varied over a 3 day run.
- How the number of solver iterations required to converge changed with resolution.
This information was also used to outline an analytical application performance model for the UM configuration. The figure below shows a task graph representation of the application model:
Poster: A. Osprey, L. Steenman-Clark, M. Manjunathaiah, and G. D. Riley. "Performance Modelling for Climate Models". NCAS staff meeting, July 2013. pdf