Opened 10 months ago

Last modified 3 weeks ago

#2082 reopened help

Nested suite fails with error "Convergence failure in BiCGstab, omg is too small"

Reported by: shakka Owned by: um_support
Priority: high Component: UM Model
Keywords: convergence, regional model Cc:
Platform: MONSooN UM Version: 10.4

Description

Hi,

I am running Stuart Webster's nested suite over the Antarctic Peninsula, and the model is repeatedely failing at 0.5 km and 1.5 km resolution halfway through the run - it fails at 0.5 km res on the 25th May 12:00 UTC and then at 1.5 km res for the 26th 00:00 and 12:00 UTC.

The job.err output shows that the model fails to converge, usually in a late timestep (3918 in the case of 20160525T1200Z_Peninsula_km0p5_ctrl_um_fcst001) and this is also reflected in the amount of time the job successfully runs for before it fails (around 73 mins).

I have looked at ticket #1884 (http://cms.ncas.ac.uk/ticket/1884) but the problem there seemed to be in the driving global model, rather than in the regional model.

The full error code is here:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 11
? Error from routine: EG_BICGSTAB_MIXED_PREC
? Error message: Convergence failure in BiCGstab, omg is too small
? Error from processor: 0
? Error number: 21
????????????????????????????????????????????????????????????????????????????????

the outputs are located at e.g. /home/elgil/cylc-run/u-ah710/log/job/20160525T1200Z_Peninsula_km0p5_ctrl_um_fcst001/03/job.err

I am re-running the model with a shorter time-step to test whether this will work, and evidently reducing the resolution avoids this problem, but I need 500 m resolution for a paper that is currently waiting for submission, so the 4 km runs are not really suitable.

Is there anything that I can do to work around this? I hope you'll understand that this is quite urgent.

Thanks

Ella

Change History (10)

comment:1 Changed 10 months ago by grenville

Ella

Reducing the time step may help.

Ideally, you need to diagnose the cause of the problem; that'd entail (initially) getting out dumps at several time steps prior to the failure and examining the fields to see where the instability may arise, then potentially delve further to look at various increments. Either way, it's not likely to be a quick fix.

comment:2 Changed 10 months ago by grenville

Ella

Another option would be to not use the mixed precision solver — do you know how to do that?

Grenville

comment:3 Changed 10 months ago by shakka

Hi Grenville,

Thanks for your response. I will look at the time stepping and then at the dumps before that. Could you please tell me how to change the mixed precision solver?

Thanks

Ella

comment:4 Changed 10 months ago by grenville

Ella

You'll need to rebuild the model with the C_DP_HLM key added to the keys_atmos_app (this is in fcm_make→env→Preprocessing, where you'll need to add the key). This will tell the model to use eg_bicgstab, rather than eg_bicgstab_mixed_prec in eg_sl_helmholtz.F90). No guarantee this will solve your problem, but the "omg" in the error message is a single precision variable, so changing to the double precision solver might help.

Grenville

comment:5 Changed 10 months ago by shakka

Hi Grenville,

Thanks for the tip, I'll give that a go.

Ella

comment:6 Changed 4 months ago by willie

  • Resolution set to answered
  • Status changed from new to closed

Closed due to lack of activity.

comment:7 Changed 4 weeks ago by shakka

Hi CMS,

I am getting the same problem again with a very similar suite, u-ai781 (also a copy of Andrew Orr's u-ag339).I have turned off rg01_rs01_lai_anc under Nested region 1 setup > Resolution 1 setup as advised, because it was throwing up a convergence failure error from EG_BICGSTAB_MIXED_PREC at timestep 0. However, I am now getting an error at timestep 1, even after having done this. Switching to the double precision solver as recommended above by Grenville didn't help either.

I am somewhat baffled as the suite has successfully run previously. The only differences (as far as I can tell) are in the STASH variables that I have requested to be output. Could this be causing the error?

Thanks,

Ella

comment:8 Changed 4 weeks ago by shakka

  • Resolution answered deleted
  • Status changed from closed to reopened

comment:9 Changed 3 weeks ago by grenville

Ella

Changing diagnostics shouldn't affect how the model runs - can you back this up to where it ran successfully and nail down the change which causes the failure?

Grenville

comment:10 Changed 3 weeks ago by shakka

Hi Grenville, that's exactly what I thought. I've been trying to figure out which change caused it to fail, but I ran a suite-clean and lost the log files. Doh!

I'll keep trying.

Thanks

Ella

Note: See TracTickets for help on using tickets.