Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#1622 closed help (fixed)

bi_linear_h issues with GA4.0 vn8.4 N216L85 port from MONSooN ibm02 to ARCHER

Reported by: luke Owned by: um_support
Component: UM Model Keywords: GA4.0, N216
Cc: Platform: ARCHER
UM Version: 8.4

Description

I am attempting to port my xjgvb job to ARCHER as job xjgve - this is a vn8.4 GA4.0 N216L85 job which I made up on the MONSooN ibm02 and has run stably for 10-years.

The majority of the required ancillary files were available in the work/n02/n02/hum directory (if they were in the corresponding /projects/um1 directory. I have had to manually copy across files that were in /projects/user_ancil/N216 or /projects/ukca/inputs/ancil/N216L85. I have checked that all files seem to be correct and have not been corrupted by the copy (by checking md5sum's of each file).

The code is based on standard GA4.0 branches which already work on ARCHER in, e.g., xkolh and the UKCA release job xlavb (with the exception that UKCA has additional changes over and above the GA4.0 vn8.4 branches). There were no problems with compilation.

The job fails with an error in bi_linear_h on timestep 2:

mkdir:: File exists
Application 16685684 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 128 starting:
  bi_linear_h_@bi_linear_h.f90:516
ATP Stack walkback for Rank 128 done
Process died with signal 11: 'Segmentation fault'
Forcing core dumps of ranks 128, 0, 1
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 01175] [c6-0c0s5n3] [Wed Aug  5 12:37:09 2015] PE RANK 110 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01181] [c6-0c0s7n1] [Wed Aug  5 12:37:09 2015] PE RANK 168 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01179] [c6-0c0s6n3] [Wed Aug  5 12:37:09 2015] PE RANK 156 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01178] [c6-0c0s6n2] [Wed Aug  5 12:37:09 2015] PE RANK 149 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01177] [c6-0c0s6n1] [Wed Aug  5 12:37:09 2015] PE RANK 133 exit signal Killed
[NID 01175] 2015-08-05 12:37:09 Apid 16685684: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 01171] [c6-0c0s4n3] [Wed Aug  5 12:37:09 2015] PE RANK 96 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01170] [c6-0c0s4n2] [Wed Aug  5 12:37:09 2015] PE RANK 84 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01168] [c6-0c0s4n0] [Wed Aug  5 12:37:09 2015] PE RANK 78 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01167] [c6-0c0s3n3] [Wed Aug  5 12:37:09 2015] PE RANK 68 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01160] [c6-0c0s2n0] [Wed Aug  5 12:37:09 2015] PE RANK 49 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01159] [c6-0c0s1n3] [Wed Aug  5 12:37:09 2015] PE RANK 36 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01158] [c6-0c0s1n2] [Wed Aug  5 12:37:09 2015] PE RANK 32 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01157] [c6-0c0s1n1] [Wed Aug  5 12:37:09 2015] PE RANK 17 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01156] [c6-0c0s1n0] [Wed Aug  5 12:37:09 2015] PE RANK 4 exit signal Killed
xjgve: Run failed
*****************************************************************
   Ending script   :   qsatmos
   Completion code :   137
   Completion time :   Wed Aug  5 12:37:12 BST 2015
*****************************************************************

I have tried running with the debug compile options, and in this case the error is:

mkdir:: File exists
Application 16679961 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 40 starting:
  _fini@0x8e1c55f
  _new_slave_entry@0x274486b
  interpolation__cray$mt$p0008@interpolation.f90:1005
ATP Stack walkback for Rank 40 done
Process died with signal 11: 'Segmentation fault'
Forcing core dumps of ranks 40, 0, 32, 112
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 00054] [c0-0c0s13n2] [Tue Aug  4 16:21:59 2015] PE RANK 65 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00064] [c0-0c1s0n0] [Tue Aug  4 16:21:59 2015] PE RANK 168 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00061] [c0-0c0s15n1] [Tue Aug  4 16:21:59 2015] PE RANK 144 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00058] [c0-0c0s14n2] [Tue Aug  4 16:21:59 2015] PE RANK 113 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00056] [c0-0c0s14n0] [Tue Aug  4 16:21:59 2015] PE RANK 84 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00055] [c0-0c0s13n3] [Tue Aug  4 16:21:59 2015] PE RANK 72 exit signal Killed
[NID 00054] 2015-08-04 16:22:03 Apid 16679961: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 00053] [c0-0c0s13n1] [Tue Aug  4 16:21:59 2015] PE RANK 48 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00052] [c0-0c0s13n0] [Tue Aug  4 16:21:59 2015] PE RANK 37 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00038] [c0-0c0s9n2] [Tue Aug  4 16:21:59 2015] PE RANK 32 exit signal Quit
_pmiu_daemon(SIGCHLD): [NID 00036] [c0-0c0s9n0] [Tue Aug  4 16:21:59 2015] PE RANK 12 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00035] [c0-0c0s8n3] [Tue Aug  4 16:21:59 2015] PE RANK 1 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00057] [c0-0c0s14n1] [Tue Aug  4 16:21:59 2015] PE RANK 96 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00059] [c0-0c0s14n3] [Tue Aug  4 16:21:59 2015] PE RANK 120 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00062] [c0-0c0s15n2] [Tue Aug  4 16:21:59 2015] PE RANK 156 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00060] [c0-0c0s15n0] [Tue Aug  4 16:21:59 2015] PE RANK 132 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00126] [c0-0c1s15n2] [Tue Aug  4 16:21:59 2015] PE RANK 188 exit signal Killed
xjgve: Run failed
*****************************************************************
   Ending script   :   qsatmos
   Completion code :   137
   Completion time :   Tue Aug  4 16:22:06 BST 2015
*****************************************************************

although the code seems to get to timestep 12 rather than timestep 2 as previously.

As I've seen from other tickets, bi_linear_h seems to be a catch-all for problems, often associated with reconfiguration or with ancillary files. As far as I can tell, all the ancillary files are fine, and this error occurs in the main UM model and not reconfiguration (indeed, it still occurs in the same place if I don't reconfigure, and instead just continue, as an NRUN, from a dump produced on MONSooN).

Any advice as to how to proceed would be greatly appreciated.

Many thanks,
Luke

Change History (6)

comment:1 Changed 4 years ago by willie

Hi Luke,

You are getting NaNs? in time step 2. This suggests errors in the initial data. You can check for NaNs? in the dumps and ancillaries by cumf'ing them with themselves.

Regards

Willie

comment:2 Changed 4 years ago by luke

Hi Willie,

Thanks for this. Despite the files having the same md5sums as on the ibm02, the start-dumps fail this cumf test:

SURGEOU1 /work/n02/n02/luke/xjgve/xjgve.surgeou1 does not exist
SURGEOUT /work/n02/n02/luke/xjgve/xjgve.surgeout does not exist
PPSMC /work/n02/n02/luke/xjgve/xjgve.ppsmc does not exist
WFOUT /work/n02/n02/luke/xjgve/xjgve.wfout does not exist
UARSOUT1 /work/n02/n02/luke/xjgve/xjgve.uarsout1 does not exist
UARSOUT2 /work/n02/n02/luke/xjgve/xjgve.uarsout2 does not exist
ICEFOUT /work/n02/n02/luke/xjgve/xjgve.icefout does not exist
MOSOUT /work/n02/n02/luke/xjgve/xjgve.mosout does not exist
PPSCREEN /work/n02/n02/luke/xjgve/xjgve.ppscreen does not exist
SSTOUT /work/n02/n02/luke/xjgve/xjgve.sstout does not exist
SICEOUT /work/n02/n02/luke/xjgve/xjgve.siceout does not exist
CURNTOUT /work/n02/n02/luke/xjgve/xjgve.curntout does not exist
FLXCROUT /work/n02/n02/luke/xjgve/xjgve.flxcrout does not exist
ATMANL /work/n02/n02/luke/xjgve/xjgve.atmanl does not exist
OCNANL /work/n02/n02/luke/xjgve/xjgve.ocnanl does not exist
ALABCOU1 /work/n02/n02/luke/xjgve/xjgve.alabcou1 does not exist
ALABCOU2 /work/n02/n02/luke/xjgve/xjgve.alabcou2 does not exist
ALABCOU3 /work/n02/n02/luke/xjgve/xjgve.alabcou3 does not exist
ALABCOU4 /work/n02/n02/luke/xjgve/xjgve.alabcou4 does not exist
ALABCOU5 /work/n02/n02/luke/xjgve/xjgve.alabcou5 does not exist
ALABCOU6 /work/n02/n02/luke/xjgve/xjgve.alabcou6 does not exist
ALABCOU7 /work/n02/n02/luke/xjgve/xjgve.alabcou7 does not exist
ALABCOU8 /work/n02/n02/luke/xjgve/xjgve.alabcou8 does not exist
FOAMOUT1 /work/n02/n02/luke/xjgve/xjgve.foamout1 does not exist
FOAMOUT2 /work/n02/n02/luke/xjgve/xjgve.foamout2 does not exist
CXBKGERR /work/n02/n02/luke/xjgve/xjgve.cxbkgerr does not exist
RFMOUT /work/n02/n02/luke/xjgve/xjgve.rfm does not exist
PPVAR /work/n02/n02/luke/xjgve/xjgve.ppvar does not exist
PP0 /work/n02/n02/luke/xjgve/xjgve.pp0 does not exist
PP1 /work/n02/n02/luke/xjgve/xjgve.pp1 does not exist
PP2 /work/n02/n02/luke/xjgve/xjgve.pp2 does not exist
PP3 /work/n02/n02/luke/xjgve/xjgve.pp3 does not exist
PP4 /work/n02/n02/luke/xjgve/xjgve.pp4 does not exist
PP5 /work/n02/n02/luke/xjgve/xjgve.pp5 does not exist
PP6 /work/n02/n02/luke/xjgve/xjgve.pp6 does not exist
PP7 /work/n02/n02/luke/xjgve/xjgve.pp7 does not exist
PP8 /work/n02/n02/luke/xjgve/xjgve.pp8 does not exist
PP9 /work/n02/n02/luke/xjgve/xjgve.pp9 does not exist
PP10 /work/n02/n02/luke/xjgve/xjgve.pp10 does not exist
WLABCOU1 /work/n02/n02/luke/xjgve/xjgve.wlabcou1 does not exist
WLABCOU2 /work/n02/n02/luke/xjgve/xjgve.wlabcou2 does not exist
WLABCOU3 /work/n02/n02/luke/xjgve/xjgve.wlabcou3 does not exist
WLABCOU4 /work/n02/n02/luke/xjgve/xjgve.wlabcou4 does not exist
PPMBC /work/n02/n02/luke/xjgve/xjgve.ppmbc does not exist
ASTART a147ae3a2cc96969e811f694a1a65741 /work/n02/n02/luke/xjgve/xjgve.astart files DO NOT compare
AINITIAL 7ca620a1cdb97d71fd0d784c70a3faf6 /work/n02/n02/ukca/ANCILS/ASTART/xjgvaa.da19991201_00 files DO NOT compare
tail: cannot open `/work/n02/n02/luke/tmp/tmp.eslogin008.32186/cumf_summ.luke.d15218.t140959.8424' for reading: No such file or directory
VERT_LEV 5a7abca89b99f5474e58cf891907258d /work/n02/n02/hum/vn8.4/ctldata/vert/L85_20m_85km_15_6km_qs_o1
ALABCIN1 /work/n02/n02/luke/xjgve/alabcin does not exist
ALABCIN2 is unset
TRANSP is unset
PERTURB is unset
tail: cannot open `/work/n02/n02/luke/tmp/tmp.eslogin008.32186/cumf_summ.luke.d15218.t141001.8481' for reading: No such file or directory
SWSPECTD 835f8daa16569e6672b2ee224a9461c0 /work/n02/n02/hum/vn8.4/ctldata/spectral/ga3_0/spec_sw_ga3_0
tail: cannot open `/work/n02/n02/luke/tmp/tmp.eslogin008.32186/cumf_summ.luke.d15218.t141001.8503' for reading: No such file or directory
LWSPECTD 7be0e6e468551fb4c7bb114ccafe4ca5 /work/n02/n02/hum/vn8.4/ctldata/spectral/ga3_0/spec_lw_ga3_0
UKCAPREC is unset
UKCAACSW is unset
UKCAACLW is unset
UKCACRSW is unset
UKCACRLW is unset
RPSEED is unset
OZONE 681eb182b862e13aed170dbf54a6de64 /work/n02/n02/hum/ancil/atmos/n216/ozone/sparc/1994-2005_360/qrclim.ozone_L85_O85 files compare, ignoring Fixed Length Header
SMCSNOWD fc8c53969592fe03880ddf661126e1f0 /work/n02/n02/ukca/ANCILS/N216/orca025/smc_snow/v1//ajthma.smc_scaled files compare, ignoring Fixed Length Header
DSOILTMP ddf2faff2dea59bd7af62ac4147aaca3 /work/n02/n02/hum/ancil/atmos/n216/orca025/soil_temp/amip/v1/qrclim.slt files compare, ignoring Fixed Length Header
SOILTYPE 25946c40934da7909847bf11e8f3e77c /work/n02/n02/hum/ancil/atmos/n216/orca025/soil_parameters/hwsd_vg/v1/qrparm.soil files compare, ignoring Fixed Length Header
GENLAND is unset
SSTIN 6e1e27798dac383edbbc54cf90f33376 /work/n02/n02/ukca/ANCILS/N216L85/qrclim.sst files compare, ignoring Fixed Length Header
SICEIN dcc6d7409f93b9259fc60ad6dc448e79 /work/n02/n02/ukca/ANCILS/N216L85/qrclim.seaice files compare, ignoring Fixed Length Header
CURNTIN is unset
MASK b0b9d254464dbcfe8bbdf93a47a22116 /work/n02/n02/hum/ancil/atmos/n216/orca025/land_sea_mask/etop01/v0/qrparm.mask files compare, ignoring Fixed Length Header
OROG e4b0288d9f0bd71a8ead5032da50bb53 /work/n02/n02/hum/ancil/atmos/n216/orca025/orography/globe30/v0/qrparm.orog files compare, ignoring Fixed Length Header
SULPEMIS c212fd992f6193bd7e5c9f4a4ccee85f /work/n02/n02/ukca/ANCILS/N216L85/qrclim.sulpsurf files compare, ignoring Fixed Length Header
MURKFILE is unset
USRANCIL is unset
USRMULTI is unset
SO2NATEM 30c9514ef425962f8e294adfaccef4a0 /work/n02/n02/hum/ancil/atmos/n216/classic_aerosol/andres_kasgnoc/v0/qrclim.sulpvolc85 files compare, ignoring Fixed Length Header
CHEMOXID 5846838e21a04b558ee8e8a707513811 /work/n02/n02/hum/ancil/atmos/n216/classic_aerosol/stochem/v0/qrclim.sulpoxid85 files compare, ignoring Fixed Length Header
FRACINIT 1cee59b02a37d0a325d84c19b7d5b50b /work/n02/n02/hum/ancil/atmos/n216/orca025/vegetation/fractions_igbp/v1/qrparm.veg.frac files compare, ignoring Fixed Length Header
VEGINIT 0aaace28a98eb8d36844878c9bf10f27 /work/n02/n02/hum/ancil/atmos/n216/orca025/vegetation/func_type_modis/v1/qrparm.veg.func files compare, ignoring Fixed Length Header
DISTURB is unset
SOOTEMIS acea7f4187a05d2b163ffc98e94a52a3 /work/n02/n02/ukca/ANCILS/N216L85/qrclim.soot files compare, ignoring Fixed Length Header
CO2EMITS is unset
LANDFRAC 11808b3486eeef1914a78eedacd59ec7 /work/n02/n02/hum/ancil/atmos/n216/orca025/land_sea_mask/etop01/v0/qrparm.landfrac files compare, ignoring Fixed Length Header
DUSTSOIL 273dc9099e6de4fb7582f646acd89af3 /work/n02/n02/hum/ancil/atmos/n216/orca025/soil_dust/hwsd/v1/qrparm.soil.dust files compare, ignoring Fixed Length Header
BIOMASS 0568a504a4e0dedb04995ef543c030a5 /work/n02/n02/ukca/ANCILS/N216L85/qrclim.biom files compare, ignoring Fixed Length Header
DMSCONC dbf8d7583e5a5c41ab52f92dae06ad55 /work/n02/n02/hum/ancil/atmos/n216/orca025/classic_aerosol/kettle/v0/qrclim.sulpdms files compare, ignoring Fixed Length Header
RIVSTOR ba5bbe0fba13d4817d056b8c23909044 /work/n02/n02/hum/ancil/atmos/n216/orca025/rivers_trip/storage/airxr_sep/v0/qrclim.rivstor files compare, ignoring Fixed Length Header
RIVCHAN 0864cde750a84679e10c7085becb3620 /work/n02/n02/hum/ancil/atmos/n216/orca025/rivers_trip/sequence/etopo5/v0/qrparm.rivseq files compare, ignoring Fixed Length Header
RIVER2A is unset
SURFEMIS is unset
AIRCREMS is unset
STRATEMS is unset
EXTRAEMS is unset
ARCLBIOG 5b62c813722502755e0caa5dc6896270 /work/n02/n02/hum/ancil/atmos/n216/aerosol_clims/stochem/biogenic/v0/qrclim.biog85 files compare, ignoring Fixed Length Header
ARCLBIOM is unset
ARCLBLCK is unset
ARCLSSLT is unset
ARCLSULP is unset
ARCLDUST /qrclim.dust85 does not exist
ARCLOCFF is unset
ARCLDLTA is unset
CARIOLO3 is unset
OCFFEMIS e7d39090dbd077fe9f7326b60e149788 /work/n02/n02/ukca/ANCILS/N216L85/qrclim.ocff files compare, ignoring Fixed Length Header
TOPMEAN ae2d0c1b047387731aa5a6c180700f2e /work/n02/n02/ukca/ANCILS/N216/orca025/hydrol_lsh/hydro1k/v1/qrparm.hydtopmn files compare, ignoring Fixed Length Header
TOPSTDEV c13b053f569a028e0ad1e1216e1cd4a4 /work/n02/n02/ukca/ANCILS/N216/orca025/hydrol_lsh/hydro1k/v1/qrparm.hydtopsd files compare, ignoring Fixed Length Header
IDEALISE is unset
ICFILE is unset

(script to produce the above is at the bottom of this comment)

All the ancillary files seem to compare, apart from the fixed length header - would this be causing the problem?

More worrying is the start dump - this has two errors:

$ /work/n02/n02/hum/vn8.4/cce/utils/cumf /work/n02/n02/luke/xjgve/xjgve.astart /work/n02/n02/luke/xjgve/xjgve.astart
CUMF successful
Summary in:                        /work/n02/n02/luke/tmp/tmp.eslogin008.32186/cumf_summ.luke.d15218.t142051.19299
Full output in:                    /work/n02/n02/luke/tmp/tmp.eslogin008.32186/cumf_full.luke.d15218.t142051.19299
Difference maps (if available) in: /work/n02/n02/luke/tmp/tmp.eslogin008.32186/cumf_diff.luke.d15218.t142051.19299


  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 =  4315
Number of fields in file 2 =  4315
Number of fields compared  =  4315
  
FIXED LENGTH HEADER:        Number of differences =       0
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       0
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =       0
DATA FIELDS:                Number of fields with differences =       2

Field  3675 : Stash Code   278 : MEAN WATER TABLE DEPTH            M  : Number of differences =       31

Field  3678 : Stash Code   281 : SATURATION FRAC IN DEEP LAYER        : Number of differences =       32
 files DO NOT compare

Looking at the difference maps I can't see what is wrong (none of the #,X,O,o,: symbols appear), and there is lots of missing data (~). Tracing back to my original job, this was initialised from /projects/user_ancil/N216/start_dumps/v1/ajthma.dai1910 on MONSooN, which shows similar problems:

Field  3684 : Stash Code   278 : MEAN WATER TABLE DEPTH            M  : Number of differences =       32

Field  3687 : Stash Code   281 : SATURATION FRAC IN DEEP LAYER        : Number of differences =       32

(although I note that the number of differences for the mean water table depth has gone from 32 to 31 by the time it gets to my job)

So while this doesn't cause a problem on the IBM with the xlf compiler, it seems to have an issue with the Cray cce compiler.

Do you have a suggestion as to what I can do - could I extract these and make up an ancillary file of the fields, removing the offending points? However, I'm not sure what points these are, unless they are somewhere in the missing data fields.

Any further suggestions would be most welcome.

Many thanks,
Luke

#!/bin/bash

. /work/n02/n02/ukca/ANCILS/N216/ancil_versions/ga4_amip/v0/hg4_ga4.0_ancilvns_std
. /work/n02/n02/hum/ancil/data/ancil_versions/filenames/v4/ancils
## Copy SCRIPT and INITFILEENV from the $HOME/umui_jobs/RUNID-XXXXXXXX directory
## UM_DATAW needs to be added manually, so don't grep every every time, just when 
## things have changed. Need to set this to
## export UM_DATAW=$DATADIR/$RUNID
## after RUNID has been set
#grep export ./SCRIPT > ./SCRIPT.export
. ./SCRIPT.export
. ./INITFILEENV

for i in `cat INITFILEENV | awk '{print $2}' | grep '=' | awk -F\= '{print $1}'`; do
    j=`printenv $i`
    if [[ $j == '' ]]; then
	echo "$i is unset"
    else
	if [ -e $j ]; then
	    k=`md5sum $j 2>&1`
	    l=`/work/n02/n02/hum/vn8.4/cce/utils/cumf $j $j | grep cumf_summ | awk -F\: '{print $2}' 2>&1` 
	    m=`tail -1 $l`
	    echo $i $k $m
	else
	    echo $i $j does not exist
	fi
    fi
done

comment:3 Changed 4 years ago by luke

Extracting these fields to netCDF and using ncdump to look at the output does give NaNs:

    _, _, _, _, _, 3.316906, NaNf, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, NaNf, 2.542373, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, NaNf, 5.110023, _, _, 5.446589, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, 3.205886, _, _, 3.205188, NaNf, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, NaNf, _, 4.209393, _, _, 3.626202, 
    3.133062, NaNf, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, 4.577909, NaNf, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, 6, _, _, 6, 5.98941, NaNf, NaNf, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, NaNf, 3.566015, 4.9749, 3.572725, 
    1.383652, 1.059796, _, _, NaNf, 0.5906482, 0.7306296, 2.145681, 1.651425, 
    5.087489, _, 5.244181, 3.88115, _, _, 3.837777, 5.193683, NaNf, 3.694045, 
    _, NaNf, NaNf, NaNf, NaNf, NaNf, NaNf, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 
    0.0008982564, 0.07291323, 0.0004118087, 0.0004097285, NaNf, _, _, _, _, 
    NaNf, 0.4305551, 0.4187672, 0.3631847, 0.260983, _, _, _, _, _, _, _, _, 
    _, _, _, 0.7088229, 0.6449063, 0.6602724, NaNf, _, _, _, _, 0.8476765, 
    0.2600253, _, _, _, _, _, NaNf, NaNf, NaNf, NaNf, NaNf, 0.2055971, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, NaNf, 0.8236123, 0.8114492, 
    _, _, _, _, _, _, _, _, _, _, NaNf, NaNf, NaNf, 6, 6, 6, 6, 6, NaNf, 6, 
    0.9989108, NaNf, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, NaNf, 1, _, _, _, _, _, _, _, 
    0.918403, 0.9177406, _, _, _, _, _, _, _, _, NaNf, 0.9188071, _, _, 
    0.9897643, NaNf, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, NaNf, _, 0.9288087, _, _, 0.9818445, 
    0.9965467, NaNf, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, 0.9223726, NaNf, _, _, _, _, _, _, 
    0.9473266, _, _, 0.9495715, 0.9496225, NaNf, NaNf, _, _, _, _, _, _, _, 
    _, _, _, _, _, _, _, NaNf, 0.9205665, 0.6712224, 0.8837127, 0.830146, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, _, _, NaNf, 1, 1, 1, 0.9999999, 1, 
    0.7795507, _, 0.8321834, 0.8566769, _, _, 0.7887475, 0.668846, NaNf, 
    _, _, _, _, _, _, _, NaNf, NaNf, NaNf, NaNf, NaNf, NaNf, 1, 1, 1, 1, 1, 
    1, 1, NaNf, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, 
    1, 1, 1, 1, 1, 1, NaNf, 1, 1, 1, 1, _, _, _, _, _, _, _, _, _, _, _, _, 
    1, 1, 1, 1, 1, _, _, _, 1, 1, 1, NaNf, _, _, _, _, 1, 1, 1, _, _, _, _, 
    1, 1, 1, 1, _, _, _, _, _, NaNf, NaNf, NaNf, NaNf, NaNf, 1, 1, 1, 1, 1, 
    _, _, _, _, _, _, NaNf, 1, 1, 1, 1, 1, 1, _, _, _, _, 1, 1, 1, 1, 1, 1, 
    _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, _, NaNf, NaNf, NaNf, NaNf, 
    1, 1, 1, 1, NaNf, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, _, 

I have corrected these by using ncdump to extract the netCDF file as text, and then using sed to remove these with a missing data value:

cdo setvrange,0.0,6.0 N216_problem_fields.nc N216_problem_fields_range.nc
cdo setmissval,-1.0e+10 N216_problem_fields_range.nc N216_problem_fields_range_newmiss.nc
/opt/cray/netcdf/default/bin/ncdump N216_problem_fields_range_newmiss.nc > N216_problem_fields_range_newmiss.cdl
sed -e 's/NaNf/-1.0e+10f/g' -e 's/NaN/-1.0e+10/g' N216_problem_fields_range_newmiss.cdl | /opt/cray/netcdf/default/bin/ncgen -o N216_problem_fields_range_newmiss_fixed.nc

I then used Xancil to make up these two fields into an ancillary file (.job file is /work/n02/n02/luke/temp/N216_problem_fields.job). This file now passes the cumf test:

  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 =     2
Number of fields in file 2 =     2
Number of fields compared  =     2
  
FIXED LENGTH HEADER:        Number of differences =       0
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       0
LOOKUP:                     Number of differences =       0
DATA FIELDS:                Number of fields with differences =       0
 files compare, ignoring Fixed Length Header

I have made up a user pre-STASHmaster file to add these on the reconfiguration step:

H1| SUBMODEL_NUMBER=1
H2| SUBMODEL_NAME=ATMOS
H3| UM_VERSION=8.4
#
#|Model |Sectn | Item |Name                                |
#|Space |Point | Time | Grid |LevelT|LevelF|LevelL|PseudT|PseudF|PseudL|LevCom|
#| Option Codes                   | Version Mask         | Halo |
#|DataT |DumpP | PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8  PC9  PCA |
#|Rotate| PPF  | USER | LBVC | BLEV | TLEV |RBLEVV| CFLL | CFFF |
#
1|    1 |    0 |  278 |MEAN WATER TABLE DEPTH            M |
2|    3 |    0 |    1 |   21 |    5 |   -1 |   -1 |    0 |    0 |    0 |    0 |
3| 000000000000000000000000000500 | 00000000000000000001 |    3 |
4|    1 |  122 | -99  -99  -99  -99  -99  -99  -99  -99  -99  -99 |
5|    0 |  900 |    0 |  129 |    0 |    0 |    0 | 9999 |   18 |
#
1|    1 |    0 |  281 |SATURATION FRAC IN DEEP LAYER       |
2|    3 |    0 |    1 |   21 |    5 |   -1 |   -1 |    0 |    0 |    0 |    0 |
3| 000000000000000000000000000500 | 00000000000000000001 |    3 |
4|    1 |  122 | -99  -99  -99  -99  -99  -99  -99  -99  -99  -99 |
5|    0 |  900 |    0 |  129 |    0 |    0 |    0 | 9999 |   18 |
#
1|   -1 |   -1 |   -1 |END OF FILE MARK                    |
2|    0 |    0 |    0 |    0 |    0 |    0 |    0 |    0 |    0 |    0 |    0 |
3| 000000000000000000000000000000 | 00000000000000000000 |    0 |
4|    0 |    0 | -99  -99  -99  -99  -30  -99  -99  -99  -99  -99 |
5|    0 |    0 |    0 |    0 |    0 |    0 |    0 |    0 |    0 |
#

and then I edited the UMUI panel to use this new ancillary file for reconfiguration. The reconfiguration step is currently queuing - I'll report back on whether the new .astart file still has a problem.

Thanks,
Luke

Last edited 4 years ago by luke (previous) (diff)

comment:4 Changed 4 years ago by luke

This new .astart file now passes the cumf test:

/work/n02/n02/hum/vn8.4/cce/utils/cumf xjgve.astart xjgve.astart


  COMPARE - SUMMARY MODE
 -----------------------
  
Number of fields in file 1 =  4315
Number of fields in file 2 =  4315
Number of fields compared  =  4315
  
FIXED LENGTH HEADER:        Number of differences =       0
INTEGER HEADER:             Number of differences =       0
REAL HEADER:                Number of differences =       0
LEVEL DEPENDENT CONSTANTS:  Number of differences =       0
LOOKUP:                     Number of differences =       0
DATA FIELDS:                Number of fields with differences =       0
 files compare, ignoring Fixed Length Header

I've set the job off to run - I'll report back on whether this seems to have solved the problem.

Thanks,
Luke

comment:5 Changed 4 years ago by luke

  • Resolution set to fixed
  • Status changed from new to closed

That seems to have got it! Job is now at timestep 108 and counting.

Many thanks for your help - I will close this ticket now.

Thanks,
Luke

comment:6 Changed 4 years ago by luke

To add, setting these values to NaN seems to be incorrect. It is better to set to missing data, then use Xconv to extrapolate over the missing data. I then separately set the NaN to 1.0 and then used this temporary field to make a mask (of 1.0 and missing data), which I then used to multiply this extrapolated field. This then introduces the correct missing data points. It is this extrapolated and masked field which needs to be made into an ancillary file.

/opt/cray/netcdf/default/bin/ncdump N216_problem_fields.nc | sed -e 's/NaNf/1.0f/g' -e 's/NaN/1.0/g' | /opt/cray/netcdf/default/bin/ncgen -o N216_problem_fields_setTo1.nc

cdo ltc,10.0 N216_problem_fields_setTo1.nc N216_problem_fields_setTo1_mask.nc

<use xconv to extrapolate over missing data to make N216_problem_fields_range_newmiss_fixed_extrap.nc>

cdo mul N216_problem_fields_range_newmiss_fixed_extrap.nc N216_problem_fields_setTo1_mask.nc N216_problem_fields_range_newmiss_fixed_extrap_masked.nc

<use Xancil to make up ancillary file>
Note: See TracTickets for help on using tickets.