Opened 4 years ago

Closed 3 years ago

#2082 closed help (fixed)

Nested suite fails with error "Convergence failure in BiCGstab, omg is too small"

Reported by: shakka Owned by: grenville
Component: UM Model Keywords: convergence, regional model
Cc: Platform: MONSooN
UM Version: 10.4



I am running Stuart Webster's nested suite over the Antarctic Peninsula, and the model is repeatedely failing at 0.5 km and 1.5 km resolution halfway through the run - it fails at 0.5 km res on the 25th May 12:00 UTC and then at 1.5 km res for the 26th 00:00 and 12:00 UTC.

The job.err output shows that the model fails to converge, usually in a late timestep (3918 in the case of 20160525T1200Z_Peninsula_km0p5_ctrl_um_fcst001) and this is also reflected in the amount of time the job successfully runs for before it fails (around 73 mins).

I have looked at ticket #1884 ( but the problem there seemed to be in the driving global model, rather than in the regional model.

The full error code is here:

???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 11
? Error from routine: EG_BICGSTAB_MIXED_PREC
? Error message: Convergence failure in BiCGstab, omg is too small
? Error from processor: 0
? Error number: 21

the outputs are located at e.g. /home/elgil/cylc-run/u-ah710/log/job/20160525T1200Z_Peninsula_km0p5_ctrl_um_fcst001/03/job.err

I am re-running the model with a shorter time-step to test whether this will work, and evidently reducing the resolution avoids this problem, but I need 500 m resolution for a paper that is currently waiting for submission, so the 4 km runs are not really suitable.

Is there anything that I can do to work around this? I hope you'll understand that this is quite urgent.



Change History (12)

comment:1 Changed 4 years ago by grenville


Reducing the time step may help.

Ideally, you need to diagnose the cause of the problem; that'd entail (initially) getting out dumps at several time steps prior to the failure and examining the fields to see where the instability may arise, then potentially delve further to look at various increments. Either way, it's not likely to be a quick fix.

comment:2 Changed 4 years ago by grenville


Another option would be to not use the mixed precision solver — do you know how to do that?


comment:3 Changed 4 years ago by shakka

Hi Grenville,

Thanks for your response. I will look at the time stepping and then at the dumps before that. Could you please tell me how to change the mixed precision solver?



comment:4 Changed 4 years ago by grenville


You'll need to rebuild the model with the C_DP_HLM key added to the keys_atmos_app (this is in fcm_make→env→Preprocessing, where you'll need to add the key). This will tell the model to use eg_bicgstab, rather than eg_bicgstab_mixed_prec in eg_sl_helmholtz.F90). No guarantee this will solve your problem, but the "omg" in the error message is a single precision variable, so changing to the double precision solver might help.


comment:5 Changed 4 years ago by shakka

Hi Grenville,

Thanks for the tip, I'll give that a go.


comment:6 Changed 4 years ago by willie

  • Resolution set to answered
  • Status changed from new to closed

Closed due to lack of activity.

comment:7 Changed 3 years ago by shakka


I am getting the same problem again with a very similar suite, u-ai781 (also a copy of Andrew Orr's u-ag339).I have turned off rg01_rs01_lai_anc under Nested region 1 setup > Resolution 1 setup as advised, because it was throwing up a convergence failure error from EG_BICGSTAB_MIXED_PREC at timestep 0. However, I am now getting an error at timestep 1, even after having done this. Switching to the double precision solver as recommended above by Grenville didn't help either.

I am somewhat baffled as the suite has successfully run previously. The only differences (as far as I can tell) are in the STASH variables that I have requested to be output. Could this be causing the error?



comment:8 Changed 3 years ago by shakka

  • Resolution answered deleted
  • Status changed from closed to reopened

comment:9 Changed 3 years ago by grenville


Changing diagnostics shouldn't affect how the model runs - can you back this up to where it ran successfully and nail down the change which causes the failure?


comment:10 Changed 3 years ago by shakka

Hi Grenville, that's exactly what I thought. I've been trying to figure out which change caused it to fail, but I ran a suite-clean and lost the log files. Doh!

I'll keep trying.



comment:11 Changed 3 years ago by willie

  • Owner changed from um_support to grenville
  • Status changed from reopened to assigned

comment:12 Changed 3 years ago by willie

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.