#3002 closed help (fixed)

'Convergence failure in BiCGstab, omg is NaN' error - again, but not in usual place

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7

Description

Hi,

Sorry to bother you once again, but my current suite (u-bk944) has again failed, giving me the error below:

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 1
?  Error from routine: EG_BICGSTAB
?  Error message: Convergence failure in BiCGstab, omg is NaN
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 144
?  Error number: 12
????????????????????????????????????????????????????????????????????????????????

I have seen this error many times before, but only usually at the very beginning of the simulation i.e. at the very first timestep. It is usually a catch-all error for missing data that where the model expects actual data, or vice versa - and usually results from an error in one of my ancillary files.

However, on this occasion, the error has occurred ~32 years into the simulation. So it can't be anything to do with my ancillary files, or any change that I have made. It has just occurred within the simulation itself. Please would you be able to advise on what I need to track this error down?

Many thanks,

Charlie

Change History (15)

comment:1 Changed 10 months ago by dcase

Charlie,

the URL in the error message gives some pointers; importantly it suggests changing the print status (you should be able to do this in your suite through the GUI, under runtime controls).

Looking at this information may be the first point of call, if you haven't already done it.

Dave

comment:2 Changed 10 months ago by charlie

Okay, I have just turned the print status to 'Extra diagnostics', and have resubmitted the suite. If this does provide any pointers, where will they be listed?

Thanks,

Charlie

comment:3 Changed 10 months ago by charlie

Hi again,

Okay, the suite has now failed at the same point, giving me the same error (below). The extra diagnostics is already turned on, so where should I look in the output?

Also, I note that the processor error is different from before, 72 this time as opposed to 144 the first time. I haven't yet worked out where this is in the world, but doesn't this imply that the blowup at different locations?

Charlie

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 1
Rank 100 [Thu Aug 29 17:42:16 2019] [c3-0c2s9n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 100
?  Error from routine: EG_BICGSTAB
?  Error message: Convergence failure in BiCGstab, omg is NaN
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 72
?  Error number: 1853
??????????????????????????????????????????????????????????????????????????

comment:4 Changed 10 months ago by dcase

A hacky approach to this would be to try to perturb the system and hope that it evolved along a numerically stable path. You can use the perturb_theta script, use of which is detailed here:

https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral#Restartingifthemodelblowsup

if I were you, I would try to restart from the current month at first, but with this perturbation in the physics.

comment:5 Changed 10 months ago by charlie

Okay, I have just tried this, but first problem is I can't find the perturb_theta script:

cwilliams@xcslc1:~/cylc-run/u-bk944/share/data/History_Data> ~moci/bin/perturb_theta.py bk944a.da18830821_00_orig --output ./bk944a.da18830821_00
-bash: /home/d00/moci/bin/perturb_theta.py: No such file or directory

Where is it on NEXCS?

Also, I don't entirely understand step 5 in these instructions, would you be able to clarify how I restart it once I have perturbed my latest start dump?

Charlie

comment:6 Changed 10 months ago by dcase

If you can't see a central location for these, you can get the moci scripts and put them in your own local directory with:

fcm co fcm:moci.xm_tr/Utilities/lib/ local_moci_lib

and this has the perturb_theta.py.

all of the restart files should be as before (excepting that you will have perturbed the dump), so you should be able to restart as you did previously.

comment:7 Changed 10 months ago by charlie

Okay, I now have the script, but still can't run it on NEXCS:

cwilliams@xcslc1:~/cylc-run/u-bk944/share/data/History_Data> module load um_tools
cwilliams@xcslc1:~/cylc-run/u-bk944/share/data/History_Data> ~/local_moci_lib/perturb_theta.py bk944a.da18830821_00_orig --output ./bk944a.da18830821_00
/usr/bin/env: python2.7: No such file or directory

Charlie

comment:8 Changed 10 months ago by dcase

I think that maybe you've unloaded python. Load it with module load python and check that when you run python --version you get 2.7

If you can't easily do this, then look through the files which are sourced when you make a new shell and remove python conflicts.

comment:9 Changed 10 months ago by charlie

Okay, that now works, and I have restarted my suite after perturbing the restart dump. I will let you know what happens, and whether it gets passed the previous blowup.

comment:10 Changed 10 months ago by charlie

Hi Dave,

Nice idea, but sadly no cigar. The suite has failed again at exactly the same point (21 August 1883, so roughly halfway through that yearly cycle), even after perturbing the 21 August 1883 restart dump. If you look in my job.err file, we can now see multiple instances of the same error e.g.

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 1
?  Error from routine: EG_BICGSTAB
?  Error message: Convergence failure in BiCGstab, omg is NaN
?        This is a common point for the model to fail if it
?        has ingested or developed NaNs or infinities
?        elsewhere in the code.

?        See the following URL for more information:
?????????????????????????????????????????????????????????????????????????????????        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error from processor: 356
?  Error code: 1?  Error number: 616

each with a different processor number. So does this imply it is blowing up in several geographic locations?

Please can you advise further? Would it be worth, instead of perturbing the August restart dump, to go back to the beginning of that year (i.e. January 1883), perturb that instead, try again?

Thanks,

Charlie

comment:11 Changed 10 months ago by dcase

Charlie,

your suggestion of going back and perturbing sounds very sensible. I suggested trying the current month first as I knew that a restart was possible, but it doesn't matter where you restart from, as long as it gets you over this problem and you record the details.

Dave

comment:12 Changed 10 months ago by charlie

Hi Dave,

Okay, I haven't done that yet, as I don't think it is actually necessary. Simply because I have branched off the above suite and created a new one, including improved ancillary files (which I needed to do anyway), and started from the endpoint of the above i.e. January 1883. The old suite, therefore, is now redundant.

My new suite, u-bm327, has now completed 1883 and has moved onto the following year i.e. it has got past the previous blowup point.

It might blow up again, of course - after all, my original suite took 32 years to blowup, but for now at least it is running. Shall I close this ticket, or leave it open until if and when my new suite blows up?

Charlie

comment:13 Changed 10 months ago by dcase

That sounds excellent. If you've made scientific changes, and these obviate the numerical instability, then proceed as you think best based upon science (rather than dodging computational hindrances).

Feel free to close the ticket, and reopen it if you run into issues later on.

comment:14 Changed 10 months ago by charlie

Many thanks.

comment:15 Changed 10 months ago by charlie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.