wiki:Archer/DDT

DDT is a powerful debugging tool which allows you to step interactively through code, apply breakpoints, tracepoints, examine variables, debug memory, and more…

Documentation for how to use the tool is available (try http://content.allinea.com/downloads/userguide-forge.pdf), but how to get it running in the UM infrastructure is less obvious (though not difficult). The instructions given here refer to a UM 8.2 set up - there may be minor changes needed for other versions of the model, but they shouldn't amount to much (contact the CMS if you need further assistance).

You will need some familiarity with where the UM puts its various files and scripts. In particular, you will need a umui_runs directory for the job you wish to debug on ARCHER and you will need to modify the qsatmos script, usually found in $DATAW/bin.

Do this:

1. Build the UM with the -g flag set

2. Submit a run for the failing model but kill the job before it runs (qdel the job) - this will ensure that you have a umui_runs directory available, which is needed later

3. Edit the qsatmos script, change

    if [[ "$OASIS" = true ]]; then
      aprun `cat OASIScoupled.conf` >> $OUTPUT
    else
      echo aprun -n $UM_IOS_NPES -N $NTASKS_PER_NODE -d $NTHREADS_PER_TASK \
          -S $NTASKS_PER_NUMANODE -ss $LOADMODULE >>$OUTPUT
      aprun -n $UM_IOS_NPES -N $NTASKS_PER_NODE -d $NTHREADS_PER_TASK \
          -S $NTASKS_PER_NUMANODE -ss $LOADMODULE >>$OUTPUT
    fi

to

    if [[ "$OASIS" = true ]]; then
      aprun `cat OASIScoupled.conf` >> $OUTPUT
    else
      echo ddt -start -noqueue -n $UM_IOS_NPES -mpiargs "-N $NTASKS_PER_NODE -d $NTHREADS_PER_TASK  -S $NTASKS_PER_NUMANODE -ss" $LOADMODULE >>$OUTPUT
      ddt  -start -noqueue -n $UM_IOS_NPES -mpiargs "-N $NTASKS_PER_NODE -d $NTHREADS_PER_TASK -S $NTASKS_PER_NUMANODE -ss" $LOADMODULE >>$OUTPUT
    fi

4. Get an interactive ARCHER session (see http://www.archer.ac.uk/documentation/user-guide/batch.php#sec-5.4.8) - in this example, I requested to have an interactive session in the short queue, for 4 nodes for 20 minutes; the -X flag is needed for X window forwarding - you will be subject to the normal wait times when doing this in the same way as submitting any job:

grenvill@eslogin005 qsub -q short -X -IVl select=4,walltime=0:20:0 -A n02-cms
qsub: waiting for job 2768460.sdb to start
qsub: job 2768460.sdb ready

--------------------------------------------------------------------------------
*** grenvill   Job: 2768460.sdb   started: 01/04/15 10:33:44   host: mom3 ***
*** grenvill   Job: 2768460.sdb   started: 01/04/15 10:33:44   host: mom3 ***
*** grenvill   Job: 2768460.sdb   started: 01/04/15 10:33:44   host: mom3 ***
*** grenvill   Job: 2768460.sdb   started: 01/04/15 10:33:44   host: mom3 ***

--------------------------------------------------------------------------------
grenvill@mom3:~> 

At this stage you will be on a job-launcher node (mom3 in this case) and can run aprun directly, ie launch a parallel job directly rather than through the scheduler.

5. cd to the umui_runs directory for the failing job (xlehy in this example), ie the directory created in step 2; load the allinea module; and run the submit script interactively

grenvill@mom3 cd ~/umui_runs/xlehy-091105123
grenvill@mom3 module load allinea
grenvill@mom3 ./umuisubmit_run

DDT should run - you'll see the DDT logo and a few seconds later the debugging window will appear like this:

picture of DDT startuo window

It is probably best to ensure that the resources needed for the job you wish to debug match those requested in the interactive session - in this example I requested 4 interactive nodes and the job was configured to run on 4x12 MPI tasks each with 2 OMP threads for a total of 4 nodes.

Memory Debugging

There are several extra steps needed for memory debugging. The model must be linked with the appropriate library - this is easiest done by modifying the bld.cfg file. For the UM 8.2 example discussed here, I added

-L $ALLINEA_TOOLS_DIR/lib/64 -Wl,--whole-archive -ldmallocthcxx -Wl,--no-whole-archive

at the beginning of the link line, the same should hold for other UM versions; then relink the code - at the ARCHER command line type

module load allinea
fcm build

make sure the new executable is copied to the appropriate place so that it is picked up when the model runs.

Change the ddt line in qsatmos thus, ie remove -start and add -memory

ddt  -noqueue -memory -n $UM_IOS_NPES -mpiargs "-N $NTASKS_PER_NODE -d $NTHREADS_PER_TASK -S $NTASKS_PER_NUMANODE -ss" $LOADMODULE >>$OUTPUT

Get an interactive session as before and follow the same submission procedure. You should now see

debug start window

You can change Memory debugging options and/or press Run to start the job. The main debugging widow will appear. You can pause the job and check memory usage.

Several memory graphics will now be available - here's a few examples (click on to enlarge):

memory usage memory stats-1 memory stats-2

Last modified 2 years ago Last modified on 21/04/15 15:32:48

Attachments (5)

Download all attachments as: .zip