Posts by author William McGinty

MONC supersedes LEM

The Met Office/NERC Cloud Model (MONC) is a large eddy simulation. It's predecessor was the Met Office Large Eddy Model (LEM). MONC includes all the LEM science, but is written in modern Fortran, using modern programming techniques. It is much faster than the LEM, much more scalable and can deal with larger domains.

MONC can be run on a variety of computers, for example Linux boxes, MONSooN and ARCHER.

MONC has a BSD license and can be accessed at the Met Office Science Repository Service.

Lustre pruning in the SWAMMA models

The SWAMMA models can generate very large amounts of data while they are running. For example the 12km model generates 4.6 TB while the 4km models generate 52 TB (for 153 day runs). Generally, the amount of work space on the Lustre (/work) file system is limited, so there is a danger that the model will crash when it runs out of space. I only had 24 TB available and when two 4km models were running in parallel, the risks were great.

Model Data Vol(TB) CRUN (days) Rate (Model days/day)
12km 4.6 30 28
4km 52 10 9.2

This problem was averted by archiving to the ARCHER Research Data Facility (RDF) whilst the model was running and to have a separate process transferring data from the RDF to the JASMIN archive. After data was archived it was deleted from the Lustre disc. However, it is essential to leave a number of complete CRUNs on the disc so that the model could continue to run, and in the event of a disaster, the model could be restarted from the end of a previous CRUN without recovery data transfers.

Three scripts were involved,

  • manual_prune,
  • prune_workdir,
  • the 4km bottom script, lbc_update_4km_v2.scr.

The first two were manual processes that operated asynchronously from the model. The last was executed at the end of each CRUN in the 4km model only, and is essentially the first two hooked together.

The manual prune script took the RUNID and examined the data directory $DATADIR/um/$RUNID to determine the first and last netCDF files currently present. This used the time stamp on the file. This led to files with names like $DATADIR/um/xkztb/ . These were decoded to extract the date portion. The manual prune script then submitted the prune_workdir job to the serial queue which did the pruning.

The pruning script first backs up the Lustre file to the RDF. If this fails, due to running out of time for example, no pruning is done.

The pruning script calculates the number of days of data available and compares it with the number of days to keep. For the 4km runs, two lots of CRUNs (each 10 days) were kept.

If there is enough to prune, then we need to calculate the end date of the prune. This makes use of the date manipulation capabilities of the date command,

end_prune=$(date -d "$last -$daystokeep days" "+%Y%m%d")

The data is then deleted a day at a time from first day to the end_prune day.

Discussion on Scaling

Two processes are competing. The model is creating data at a certain rate and the archiving process works at a different rate. The production process can be estimated as follows.

For the 12km model, one CRUN is completed in 30/28 days = 26 hrs. But this is (30/153)x4.6 = 0.9TB, so the production rate is 29hrs/TB.

For the 4km model one CRUN completed in 10/9.2 days = 26 hrs. But this is (10/153)x52 = 3.4TB, so the production rate is 26/3.4 = 7.6 hours/TB.

Since the archival rate from Lustre to the RDF is much faster, at 3 hours/TB, it can keep up with the production of one CRUN. With the 4km model it was just feasible to keep two CRUNs.

Large Eddy Model

The Met Office Large Eddy Model code is now available to NERC users with a PUMA account.

Two versions are available. One runs on single processor machines, the other on ARCHER. For details see The Met Office Large Eddy Model.

Mirroring Standard Ancillary Files

The UM requires ancillary files to run. These include the land-sea mask, the orography, vegetation and ozone ancillary files, among others.

The Met Office produces and maintains a standard set of ancillary files on their supercomputers in a comprehensive collection of directories known as the ancillary tree. There are sets of ancillary files for global and limited area domains. Currently, the global domains are: n2004, n216, n216e, n320, n48e, n512, n512e, n768e, n96, n96e. The limited area domains are: e4_11001000_euro, m4_288360_uk, my_600360_nae, ukv.

CMS mirror these monthly onto ARCHER. On ARCHER, the standard ancillary tree is stored in $UMDIR/ancil, a mirror of the corresponding directories at the Met Office.

The ancillary tree grew from 3.7 TB in December 2015 to 6.5 TB at July 2019.

Details of the method used to manage mirroring of the ancillary files can be found here.

CAP9.0 installation on ARCHER

Details of the CAP9.0 installation can be found here.

CAP9.0 Installed on ARCHER

The Central Ancillary Program (CAP) version ANCIL9.0 has now been installed on ARCHER. You can find the executables and scripts at $UMDIR/CAP9.0/build/bin. The scientific and technical guides can be found on the Met Office Science Repository.

As always, report any issues with this software to the CMS help desk.