Working with Very Large PP Files

Enhancements to the UKMO library to support access to PP-format files over 2Gb in size. Released in the Met Office at TIDL v1.28 Note. stashplit / metasplit / PPH have problems with large PP files; see below "What about PPH and stashsplit / metasplit ?"

Table of Contents / Questions

1. Do I need to know about this?

This section explains how you can use PP files that may exceed 2Gb (2^31 bytes) in size.

It can also affect anyone using PP files with non-standard extra information in the header, as these could become awkward to read/write alongside large files.

If you do not use any such files, these changes will not affect you.

2. What is the UKMO "largefile extension" ?

Previously, the UKMO library could only access files up to a given maximum size. This is because a PP header word is used to store the file position, which must then fit into 31 bits. This means that the filesize must be less than (2^31) bytes, which is 2147483648 (2Gb).

The largefile extension provides an enhanced operation mode, which allows the UKMO library to handle files in excess of 2Gb in size. (the new limit is 2^62 which is simply huge, over 1018 bytes).

The behaviour of the routines for reading and writing PP data (such as PPA and PPW) has been extended for this -

3. How is the extension controlled ?

The altered behaviour is controlled by a new system variable, !PP_LARGEFILE_EXTEND.

4. How do I read data from a "large" PP file?

PP_ASSOC and PPA are used in the usual way to open files and read fields, but !PP_LARGEFILE_EXTEND is set beforehand.

5. How do I write data to a large PP file?

You can use the usual UKMO routines to write PP data to a file, after setting !PP_LARGEFILE_EXTEND

NOTE: When enabled, the chosen extension word in each field header is zeroed whenever a PP field is written to file (in WRITEPP). This is done because these values are always zero-checked when reading in a field (to avoid overwriting extra data stored in header words).

The actual extension behaviour is implemented in the routines PPA, SEARCH, READFDR and WRITEPP, so all I/O must be done via these routines.

6. How large can PP files now be ?

There is no longer any real practical limit to the files that can be accessed, now the address range has been expanded from 230 to 262 bytes (which is more than 1018).

However, the total data area size available to IDL programs is effectively limited to about 3Gbytes (at least for 32 bit versions)

7. Can I open "ordinary" PP files with the largefile extension enabled ?

Yes. This will work fine as long as the appropriate header word is unused.

Note: PPA will issue a warning when you open a "small" file (<2Gb) with largefile operation enabled. This can be disabled with /Quiet.

8. What versions of TIDL support this ?

The new extension is currently tested for the 32-bit Linux release. In future, this will be extended to 64-bit versions as well.

9. Can I change the !PP_LARGEFILE_EXTEND value to access different files ?

You should not change the value while any files are open, because the setting is used whenever reading or writing file headers.

10. How do I decide which !PP_LARGEFILE_EXTEND value to use for largefile operations ?

For "typical" PP files, !PP_LARGEFILE_EXTEND=0 is the usual choice.

This setting specifies that the LBRSVD[0] header word is used to store the pointer extension. But, you might need to access a file which uses this word to store additional user information. In this case, the extension allows various other header words to be selected instead.

In theory, the extension can use any of the following header words - LBRSRVD[0:3], LBSRCE, LBUSER[0:6].

value header-word  
0 LBRSVD[0] (most commonly used)
1 LBRSVD[1]  
2 LBRSVD[2]  
3 LBRSVD[3]  
7 LBUSER[2]  
11 LBUSER[6]  
Note: 4 (LBSRCE) is not a viable choice

Header Check

You can also check which header words are zero in a file header by calling PPA(my_filename, /Check_pp_largefile_extend). This prints out a table showing which header words are zero. For example -
WAVEON-TIDL> pp=ppa("~idl/testing/smoketest/data/one96x72.pp",/check)
file ~idl/testing/smoketest/data/one96x72.pp :
  values useable in !PP_LARGEFILE_EXTEND (1==ok) ...
  00 : 1
  01 : 1
  02 : 1
  03 : 1
  04 : 0
  05 : 0
  06 : 0
  07 : 1
  08 : 0
  09 : 0
  10 : 0
  11 : 1
(The same information is also returned in a BYTE vector. The printout can be suppressed with /Quiet)

11. Should I enable the Largefile extensions all the time ?

You can only adopt such a "blanket" policy if you can specify a spare header word which will never be used in any PP file you want to access.

12. What are the possible new errors and warnings ?

  1. PP_ASSOC (called by PPA) issues a warning if a file <2Gb is opened with the largefile extension enabled. This is silenced by /Quiet
  2. PP_ASSOC (called by PPA) produces an error if a large file is opened with the extension disabled
  3. READFDR (called by SEARCH, PPA etc) produces an error on any attempt to read a PP field with a non-zero extension word
  4. READFDR (called by SEARCH, PPA etc) will produce an error if !PP_LARGEFILE_EXTEND has an out-of-range value

13. What about PPH and stashsplit / metasplit ?

These methods are not affected by the changes, but there are some related problems -
  1. stashsplit / metasplit - not yet tested with large files, and may not work
  2. PPH files - cannot reference large files (not part of standard PP data specification)

Recommended usage : Do not use MAKEPPH if large PP files are present - it will (successfully) create a PPH file which is incorrect. This is because of a fundamental limitation in the PP format definition