UKESM1-AMIP Nudging suite failed with NaNs in BICGstab

I am running a nudged UKESM1-AMIP suite (u-bk347). The suite starts from 2014 and the total run length is 5 years. It runs fine in the first year, but failed at the beginning of 2015 with the following error:

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 1
?  Error from routine: EG_BICGSTAB
?  Error message: NaNs in error term in BiCGstab after      1 iterations
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 287
?  Error number: 88

[287] exceptions: An non-exception application exit occured.
[287] exceptions: whilst in a serial region
[287] exceptions: Task had pid=31128 on host nid04824
[287] exceptions: Program is "/home/d03/yaweiqu/cylc-run/u-bk347/share/fcm_make_um/build-atmos/bin/um-atmos.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
[NID 04675] 2019-07-09 23:45:14 Apid 70099863: initiated application termination
[FAIL] um-atmos # return-code=137
2019-07-09T23:45:25Z CRITICAL - failed/EXIT

I found the time series emission files used in nudging (with Gregorian calendar) may not include the data after 2015. The VolcanicAod? data ends on 9-Jun-2014 and the CH4_biomass data ends on 1-Jan-2016.



Could the BICGstab error be linked to the files above? If so, are there any other emission files can be used for 2015 onwards nudging simulation, or are there other causes of this error?

Thank you in advance.


comment:1 Changed 12 months ago by willie

Hi Yawei,

The job fails in cycle 20150101 and the start dump there is free from NaNs?. But the atmos_main job has only done two time steps, so it is still with in the range of the volcanic aerosol, which at 1982 months, range from 1850 to 2015 and two months. I am not sure what you can do about this.

If you run ncdump -h on your volcanic aerosol files, you will see

source_data_filepath = "/group_workspaces/jasmin2/tids/CMIP6_ANCIL/users/bjohnson/CMIP6_volc_forcing_source_data/CMIP_UMUKCA_radiation_v3.nc" ;
tropopause_climatology = "/group_workspaces/jasmin2/tids/CMIP6_ANCIL/users/bjohnson/CMIP6_volc_forcing_source_data/trop_ht.pp" ;
Conventions = "CF-1.5" ;
source = "CMIP6 volcanic aerosol optical properties climatology supplied by Luo Beiping, and downloaded from ETHZ (ftp://iacftp.ethz.ch/pub_read/lu
o/CMIP6/) by Ben Johnson (ben.johnson@metoffice.gov.uk)" ;

On ARCHER there are CMIP6 aerosol data under /work/y07/y07/umshared/CMIP6_ANCIL/data/ancils/n96e that might be worth looking at.


comment:2 Changed 12 months ago by yaweiqu

Hi Willie,

Sorry for the delay. I have no access to ARCHER, but I can find the aerosol data under /projects/ancils/cmip6/ancils/n96e on Monsoon. I've looked at the data under /projects/ancils/cmip6/ancils/n96e/timeseries_1850-2014/VolcanicAod/v3/ and the data I used. As you said, the atmos_main job is still in the range of the aerosol data. The NaNs? may not be related to the aerosol data.

I found #2392 reporting the same error. I've tried to run the model with output diagnostics set to high (https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints), and tried to set the time step as 144. None of these works and the error message is still "NaNs? in error term in BiCGstab after 1 iterations", with no more information. I can't find out why the job failed in 20150101 or how to correct it. Do you have any idea?

Thanks for your help.


comment:3 Changed 12 months ago by grenville


Could you try switching off the ancillary file updating - the error is strongly suggestive of an input data problem (as indicated by Willie) - switching off ancillary file updating (and switching it back on incrementally) may point to the problem.


comment:4 Changed 12 months ago by yaweiqu

Hi Grenville and Willie,

Many thanks for the help. I will switch on and off the ancillary file updating and see where the problem is.


comment:5 Changed 10 months ago by grenville

closed for inactivity

