Opened 5 months ago

Last modified 5 months ago

#2082 new help

Nested suite fails with error "Convergence failure in BiCGstab, omg is too small"

Reported by: shakka Owned by: um_support
Priority: high Component: UM Model
Keywords: convergence, regional model Cc:
Platform: MONSooN UM Version: 10.4

Description

Hi,

I am running Stuart Webster's nested suite over the Antarctic Peninsula, and the model is repeatedely failing at 0.5 km and 1.5 km resolution halfway through the run - it fails at 0.5 km res on the 25th May 12:00 UTC and then at 1.5 km res for the 26th 00:00 and 12:00 UTC.

The job.err output shows that the model fails to converge, usually in a late timestep (3918 in the case of 20160525T1200Z_Peninsula_km0p5_ctrl_um_fcst001) and this is also reflected in the amount of time the job successfully runs for before it fails (around 73 mins).

I have looked at ticket #1884 (http://cms.ncas.ac.uk/ticket/1884) but the problem there seemed to be in the driving global model, rather than in the regional model.

The full error code is here:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 11
? Error from routine: EG_BICGSTAB_MIXED_PREC
? Error message: Convergence failure in BiCGstab, omg is too small
? Error from processor: 0
? Error number: 21
????????????????????????????????????????????????????????????????????????????????

the outputs are located at e.g. /home/elgil/cylc-run/u-ah710/log/job/20160525T1200Z_Peninsula_km0p5_ctrl_um_fcst001/03/job.err

I am re-running the model with a shorter time-step to test whether this will work, and evidently reducing the resolution avoids this problem, but I need 500 m resolution for a paper that is currently waiting for submission, so the 4 km runs are not really suitable.

Is there anything that I can do to work around this? I hope you'll understand that this is quite urgent.

Thanks

Ella

Change History (5)

comment:1 Changed 5 months ago by grenville

Ella

Reducing the time step may help.

Ideally, you need to diagnose the cause of the problem; that'd entail (initially) getting out dumps at several time steps prior to the failure and examining the fields to see where the instability may arise, then potentially delve further to look at various increments. Either way, it's not likely to be a quick fix.

comment:2 Changed 5 months ago by grenville

Ella

Another option would be to not use the mixed precision solver — do you know how to do that?

Grenville

comment:3 Changed 5 months ago by shakka

Hi Grenville,

Thanks for your response. I will look at the time stepping and then at the dumps before that. Could you please tell me how to change the mixed precision solver?

Thanks

Ella

comment:4 Changed 5 months ago by grenville

Ella

You'll need to rebuild the model with the C_DP_HLM key added to the keys_atmos_app (this is in fcm_make→env→Preprocessing, where you'll need to add the key). This will tell the model to use eg_bicgstab, rather than eg_bicgstab_mixed_prec in eg_sl_helmholtz.F90). No guarantee this will solve your problem, but the "omg" in the error message is a single precision variable, so changing to the double precision solver might help.

Grenville

comment:5 Changed 5 months ago by shakka

Hi Grenville,

Thanks for the tip, I'll give that a go.

Ella

Note: See TracTickets for help on using tickets.