Opened 13 months ago

Last modified 7 months ago

#1857 reopened help

Performance of Rose suite on Monsoon

Reported by: cthomas Owned by: annette
Priority: normal Component: NEMO/CICE
Keywords: NEMO, NEMOVAR, Rose, perftools Cc:
Platform: MONSooN UM Version: <select version>

Description

Hi all,

I am running a Rose assimilation suite on Monsoon (and ARCHER) which uses NEMO/CICE and NEMOVAR. There are three main stages which repeat each cycle: the observation operator (obsoper), NEMOVAR, and finally the incremental analysis update (IAU). The suite was primarily developed at the Met Office and was divided into different Rose tasks, which works well on Monsoon but runs into problems on ARCHER due to the significantly longer queues. In order to run the suite more efficiently on ARCHER I have converted the tasks into one monolithic job which uses the same set of processors for all three stages mentioned above instead of claiming and releasing resources each time.

I have two questions about this. Firstly, I am interested in whether it's possible to check the performance of the job on Monsoon using perftools or similar, in order to check for any places where the job could be made more efficient. I found this similar question for the UM (http://cms.ncas.ac.uk/ticket/1711) so I hope there is a similar way to do it for Rose.

Secondly I would be interested to know if any performance gain could be achieved between the IAU and obsoper stages. The IAU produces the analysis from one cycle which is then used as the background in the obsoper stage for the next cycle. At the moment the analysis is written to disk (as a netcdf file) during IAU before being read in again by the obsoper. We were wondering if keeping it in memory is possible at all, and if so, whether doing that would actually bring any benefit. In other words is reading from memory faster or slower than reading from a disk. I think at the very least there might be issues with the namelist (NEMO complains if you have both ln_asmiau and ln_bkgwri set to true, for example), but if that can be overcome then maybe it's possible… I realise this might be a tricky question to answer but if you had any advice I'd be very interested to hear it.

Thanks,
Chris

Change History (14)

comment:1 Changed 11 months ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to assigned

Hi Chris,

Well done on sorting out the tasks spawned by the NEMOVAR suite. I'm not sure if you are aware of "rose bunch" or maybe used it? I only heard about it last week but it's a feature of rose that allows you to group tasks from the same family into a single job:

http://metomi.github.io/rose/doc/rose-rug-advanced-tutorials-bunch.html

I think you should be able to use craypat with Rose. In fact I'm pretty sure they do this at the Met Office. I haven't tried it but I can look into it for you.

Reading from memory will be much quicker than from disk. Without looking at it I'm not sure how much effort it would be to combine these NEMO steps but it would probably be worth confirming that this is a bottleneck first.

Hope this helps (and sorry for the delay in replying),

Annette

comment:2 Changed 9 months ago by cthomas

Hi Annette,

It sounds like starting with craypat is the best idea. If you have a few tips that would be great!
Thanks,
Chris

comment:3 Changed 9 months ago by annette

Hi Chris,

To get CrayPAT working I would try running something like this example on the ARCHER website (it should work more or less the same for MONSooN):
http://www.archer.ac.uk/documentation/best-practice-guide/performance.php#sec-5.2

Based on my old NEMOVAR suite the steps I think you need to follow are:

  1. Edit the relevant config file so that FCM does not delete the intermediate build files as these are needed by CrayPAT. I think this should go in app/fcm_make_nemo/file/xc40-cce-opt.cfg:
    build-ocean.prop{keep-lib-o} = true
    
  1. Then in your suite.rc file add module load perftools to the pre-command scripting for your fcm_make2_nemo task.
  1. Set the suite to build but not run. I think you may need to do a full re-build here.
  1. Once the build has run, go to the directory holding the executable: ~/cylc-run/<suite-id>/share/fcm_make_nemo/build-ocean/bin and run:
    pat_build -o nemo.exe+samp nemo.exe
    
    This creates a new executable called nemo.exe+samp.
  1. Now edit the run script in the suite to use the new executable. In my suite I think this is in app/nemo_cice/bin/run_nemo_cice.sh.
  1. Switch off the build and run the model.
  1. You should find some crayPAT output in your work directory. You can run pat_report on the output as in the ARCHER example, eg:
    pat_report -o nemo_samp.pat nemo+samp+55790-22s.xf
    

I haven't actually tried this, but Grenville has run CrayPAT with a UM Rose suite on ARCHER in a similar fashion.

If you have any issues let me know.

Annette

comment:4 Changed 9 months ago by cthomas

Hi Annette,

Thanks! I'll try this and let you know how it goes.
Chris

comment:5 Changed 8 months ago by annette

  • Resolution set to answered
  • Status changed from assigned to closed

Chris,

We are having a tidy up of the helpdesk, so I'm going to close this ticket as we have answered it. If you find this doesn't work you can reopen it or create a new one.

Annette

comment:6 Changed 8 months ago by cthomas

Hi Annette,

I'm opening this again because I have been testing it on Monsoon and run into a problem at step 4. (By the way, for the record, it's necessary to load cray-netcdf as well as perftools in step 2.)

I am doing this logged into xcml00. After moving to the bin directory, I do:

module load perftools
pat_build -o nemo_orca025.exe+samp nemo_orca025.exe

The output is:

INFO: A maximum of 130 functions from group 'caf' will be traced.
INFO: A maximum of 133 functions from group 'mpi' will be traced.
INFO: A maximum of 25 functions from group 'realtime' will be traced.
INFO: A maximum of 62 functions from group 'syscall' will be traced.
INFO: A maximum of 100 functions from group 'upc' will be traced.
/opt/cray/cce/8.3.4/cray-binutils/x86_64-unknown-linux-gnu/bin/ld: cannot find -lalpslli
/opt/cray/cce/8.3.4/cray-binutils/x86_64-unknown-linux-gnu/bin/ld: cannot find -lalpsutil
FATAL: The executable '/opt/cray/cce/8.3.4/cray-binutils/x86_64-unknown-linux-gnu/bin/ld' returned error status 0x100.

It looks like I haven't loaded certain libraries but the (guessed) module load alps doesn't help.

Thanks,
Chris

comment:7 Changed 8 months ago by cthomas

  • Priority changed from low to normal
  • Resolution answered deleted
  • Status changed from closed to reopened

comment:8 Changed 8 months ago by annette

Hi Chris,

I have tested this with a small test program on MONSooN and it seems OK. What is your suite-id?

Annette

comment:9 Changed 7 months ago by cthomas

Hi Annette,

It is puma-aa328. To run you need to set BUILD to true in rose-suite.conf. The CrayPAT lines in xc40-cce-opt.cfg and suite.rc are currently commented out, but I have added "CrayPAT" to make them easier to find.

Chris

comment:10 Changed 7 months ago by annette

Hi Chris,

The error you get looks very like this:
http://cms.ncas.ac.uk/ticket/1711#comment:3

Could you try the Cray suggestion:

pat_build -D link-instr=-L/opt/cray/alps/default/lib64 ...

Annette

comment:11 Changed 7 months ago by cthomas

Hi Annette,

That compiled, thanks! I have not yet tried to run it, but if I have any problems when doing so I'll let you know.

Cheers,
Chris

comment:12 Changed 7 months ago by cthomas

To follow up on this, I have run a complete assimilation cycle (obsoper → nemovar → IAU) with the profiling enabled. Running pat_report on the output from each stage produced .pat and .apa files. The .pat files show the breakdown of usage by function.

For obsoper:

  • 62.3% USER (further broken down into different function calls in NEMO)
  • 21.6% MPI
  • 10.5% IO
  • 5.5% ETC

For nemovar:

  • 62.8% MPI
  • 19.4% USER
  • 10.0% IO
  • 6.3% ETC
  • 1.5% BLAS

Although the values of USER and MPI are very different, they are both larger than IO. Does this imply that file reading/writing is not a significant bottleneck?

comment:13 Changed 7 months ago by annette

Hi Chris,

There may be other costs involved with the file reads and writes (other than system IO calls) depending on the code. For example, if the reads were done by a single process with data then sent to other processes, there would be MPI costs and load imbalance involved. There may also be some processing involved before writing data.

So it might be worth looking at the times for the function calls where the data is read or written, or adding in your own timers.

Annette

comment:14 Changed 7 months ago by cthomas

Hi Annette,

Here's the full obsoper table:

  Samp% |    Samp |  Imb. |  Imb. |Group
        |         |  Samp | Samp% | Function
        |         |       |       |  PE=HIDE

 100.0% | 24316.1 |    -- |    -- |Total
|-----------------------------------------------------------------------
|  62.3% | 15140.1 |    -- |    -- |USER
||----------------------------------------------------------------------
||   5.7% |  1381.5 |  32.5 |  2.3% |theta2t$insitu_tem_
||   5.2% |  1256.9 |  35.1 |  2.7% |obs_grd_bruteforce$obs_grid_
||   3.7% |   902.8 |  99.2 | 10.0% |init_field_bufferize$field_bufferize_
||   3.5% |   852.4 |  69.6 |  7.6% |iom__write_field3d$iomanager_
||   3.0% |   732.2 |  30.8 |  4.1% |tra_ldf_iso$traldf_iso_
||   2.8% |   677.2 |  44.8 |  6.2% |tra_adv_tvd$traadv_tvd_
||   2.7% |   646.6 |  48.4 |  7.0% |ldf_slp$ldfslp_
||   2.4% |   571.6 |  25.4 |  4.3% |nonosc$traadv_tvd_
||   2.3% |   561.0 |  15.0 |  2.6% |tra_zdf_imp$trazdf_imp_
||   2.1% |   511.9 |  15.1 |  2.9% |tke_tke$zdftke_
||   2.0% |   476.8 |  36.2 |  7.1% |glob_sum_2d$lib_fortran_
||   1.9% |   460.6 |  11.4 |  2.4% |dyn_zdf_imp$dynzdf_imp_
||   1.9% |   455.8 |  62.2 | 12.1% |histw_rnd$histcom_
||   1.6% |   398.9 |  27.1 |  6.4% |dyn_ldf_bilap$dynldf_bilap_
||   1.6% |   384.7 |  21.3 |  5.3% |tke_avn$zdftke_
||   1.2% |   281.9 |  11.1 |  3.8% |tmx_itf$zdftmx_
||   1.1% |   260.8 |  61.2 | 19.1% |moycum$mathelp_
||   1.0% |   240.4 |  39.6 | 14.2% |histwrite_real$histcom_
||======================================================================
|  21.6% |  5261.5 |    -- |    -- |MPI
||----------------------------------------------------------------------
||  10.1% |  2463.0 | 445.0 | 15.4% |mpi_waitall
||   5.0% |  1222.2 | 647.8 | 34.8% |MPI_RECV
||   3.6% |   883.8 | 343.2 | 28.1% |MPI_ALLREDUCE
||   1.8% |   430.0 | 768.0 | 64.4% |MPI_WAIT
||======================================================================
|  10.5% |  2555.3 |    -- |    -- |IO
||----------------------------------------------------------------------
|   9.8% |  2371.5 |  66.5 |  2.7% | read
||======================================================================
|   5.5% |  1347.6 |    -- |    -- |ETC
||----------------------------------------------------------------------
||   3.3% |   801.3 | 464.7 | 36.9% |__cray_dcopy_HSW
||   1.5% |   354.2 | 327.8 | 48.3% |__cray_dset_HSW
|=======================================================================

Of the more expensive routines, init_field_bufferize and iom__write_field3d are related to IO.

(NB obs_grd_bruteforce is called because I set ln_grid_global to true in NEMO in order to avoid a crash.)

I'm not sure if the MPI-related calls can be disentangled any further.

The equivalent nemovar table is:

  Samp% |    Samp |    Imb. |  Imb. |Group
        |         |    Samp | Samp% | Function
        |         |         |       |  PE=HIDE

 100.0% | 71879.3 |      -- |    -- |Total
|-----------------------------------------------------------------------
|  62.8% | 45154.3 |      -- |    -- |MPI
||----------------------------------------------------------------------
||  31.3% | 22494.1 | 11843.9 | 34.7% |MPI_RECV
||  28.0% | 20128.9 |  8135.1 | 28.9% |MPI_WAIT
||   2.6% |  1881.9 |  4741.1 | 72.0% |MPI_ALLREDUCE
||======================================================================
|  19.4% | 13909.8 |      -- |    -- |USER
||----------------------------------------------------------------------
||   5.2% |  3742.8 |  3745.2 | 50.3% |dif_imp1d_chol_3d$app_dif_
||   3.7% |  2692.6 |  2366.4 | 47.0% |difadj_imp1d_chol_3d$app_dif_
||   2.4% |  1729.8 |    67.2 |  3.8% |comp_sum$mppsum_
||   1.5% |  1094.5 |    24.5 |  2.2% |dot_product_ctlvec$control_vectors_
||======================================================================
|  10.0% |  7165.4 |      -- |    -- |IO
||----------------------------------------------------------------------
|   9.9% |  7100.8 |   280.2 |  3.8% | read
||======================================================================
|   6.3% |  4558.0 |      -- |    -- |ETC
||----------------------------------------------------------------------
||   4.4% |  3137.8 |  1430.2 | 31.5% |__cray_dcopy_HSW
||   1.6% |  1136.3 |   433.7 | 27.8% |__cray_dset_HSW
||======================================================================
|   1.5% |  1090.6 |      -- |    -- |BLAS
||----------------------------------------------------------------------
|   1.2% |   873.7 |    32.3 |  3.6% | daxpy_
|=======================================================================

This is a bit less obvious but maybe more things are contained in the MPI routines.

Thanks,
Chris

Note: See TracTickets for help on using tickets.