"Metasplit" or "stashsplit" format


"Metasplit" or "stashsplit" (the two names are equivalent) is a scheme implemented for PV-WAVE/IDL which stores PP data in small PP files with systematic names indicating their contents. The files are organised into metasplit (stashsplit) directories, typically with separate directories for different UM experiments, perhaps distinguishing by meaning period as well. The main advantages of storing the fields in metasplit directories are:

The files in metasplit directories are typically numerous and with long names, so they are inconvenient for humans to handle. Tools are provided for accessing and manipulating the contents of the directories. You are recommended not to move, rename etc. the files manually. This is possible, but it requires understanding of how the metasplit works, which is not generally necessary.

How to access data in metasplit directories

In PV-WAVE/IDL, use ss_assoc to "associate" the directory, just as you would use pp_assoc for an ordinary PP file. This is analogous to "opening" the file on a logical unit in PV-WAVE/IDL or Fortran. E.g.
WAVE> handle=ss_assoc('aatza')
% SS_ASSOC: Associated 3120 fields from /data/hcmim1/hadsa/aatza
Subsequently you can get fields from the directory using ppa e.g.
WAVE> field=ppa(handle,ss(atmos=24))

How to put data into metasplit directories

The following can be used to put fields into metasplit directories: It is often convenient to separate data from a given UM job into separate directories according to meaning period. For instance, monthly data from job aatza may be stored in directory aatza.000001, seasonal data in aatza.000003, annual in aatza.000100 and decadal in aatza.001000 (the suffix is of the form YYYMM). To have data delivered in this way by query_masscam or query_camelot specify option -partition. With parmah or pariah, use -H ss_partition instead of -H makepph.

It is sometimes important to be aware that each three-dimensional UM field (i.e. fields on several vertical levels) is stored in a single file by default; it is not split up into its separate levels. This can lead to problems with fetching data from archive at the Met Office. If you originally fetch just one level, say, and later decide you need all the others too, a new file will be created with these new levels, but it will overwrite the original file. Hence you will have lost the original data. This is particularly a problem with query_camelot, which by default fetches only the data which is not already stored on HP; you may end up repeatedly fetching alternate selections of levels. To get round this problem, you could try any of these:

The problem is also avoided by using a non-default metafunction which splits up fields by level, for instance pp_code2fn3.

How to manage fields in metasplit directories

The following tools are available. Except for ss_partition, they are PV-WAVE/IDL utilities:

How metasplit works

Metasplit directories contain a file named pph, which is a concatenation of the headers only of all the PP fields in the directory. The pph file is made automatically by most utilities which change the contents of the directory (but you have to remember to run makepph if you use metasplit). ss_assoc reads the pph file to find out what's in the directory, and it holds this information in memory. When you ask for a field through ppa, it first refers to the header information in memory to decide whether the field exists in the directory.

If the field exists, ppa then has to fetch it from the appropriate PP file. Metasplit works by defining a mapping between a PP field header and the name of the file in which that field is stored. The routine which performs this translation is called the metafunction of the directory. For directories created with metasplit defaults, the metafunction is pp_ss_basename, which generates filenames that depend on time, meaning period, submodel, stash code and processing code. ppa uses the metafunction to work out where to find the requested field. To save further time, the pph file also records the location within the file at which the field will be found, so ppa can go straight there and does not have to scan the file looking for the field. When fields are written to metasplit directories, the metafunction is used similarly to derive the names of the files they should be stored in.

The metafunction for the directory must be a PV-WAVE/IDL function in your WAVE_PATH, and its name is stored in the file called metafunction in the metasplit directory. If there is no metafunction file, the directory uses the original "stashsplit" naming convention, defined by short_pp_ss_fn. This scheme is no longer used by default because it does not refer to the submodel number, introduced at version 4.1 of the UM.

The new default scheme should be adequate for UM data. It may not be adequate for data you have created yourself, however, for instance if the fields do not have stashcodes. The important thing is that the metafunction must give different names for fields which may be stored by separate operations. A number of alternatives are already available, of which the most flexible general-purpose choice is pp_code2fn3, which distinguishes files on the basis of some additional information including the PP field code. If this is not suitable, you can define any convention you like by writing a PV-WAVE/IDL function which returns a vector of file basenames given a vector of PP fields as argument, and putting the function in a directory in your WAVE_PATH.

A non-default metafunction must be specified at the time the metasplit directory is created. Usually the directory is created automatically when data is first put in it (this is what metasplit does, for instance). To get round this, you can create the directory explicitly in advance, using ss_mkdir. metasplit and other routines also have keywords allowing you to specify the metafunction for a newly created directory.

To change the metafunction of a directory with data in it, you cannot just alter the metafunction file (unless you are absolutely certain that the arrangement of existing fields in files will be the same for the two metafunctions). Use ss_mv to reprocess the directory.