Opened 4 months ago

Last modified 4 months ago

#3561 reopened help

BiCGstab: omg is too small

Reported by: ggxmy Owned by: um_support
Component: Nesting Suite Keywords:
Cc: Platform: Monsoon2
UM Version: 11.8

Description

Regn1_natl_RA2M_um_fcst_000 fails in my vn11.9 nesting suite u-cf348. This process once ran with 8km resolution but is failing after I changed to 5km resolution. It leaves this error message;

?  Error code: 11
?  Error from routine: EG_BICGSTAB
?  Error message: Convergence failure in BiCGstab after      1 iterations: omg is too small
?        See the following URL for more information:
?        https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
?  Error from processor: 0
?  Error number: 51

Following the linked page I set PRINT_STATUS=PrStatus_High in app/um/rose-app.conf but is this the right one? Now I'm getting longer job.err and job.out but I'm not sure where I can find useful information. Please could I have some help? Thank you.

Masaru

Attachments (5)

lsm.png (33.0 KB) - added by grenville 4 months ago.
orog.png (113.1 KB) - added by grenville 4 months ago.
nestcorrupt.png (63.2 KB) - added by ggxmy 4 months ago.
LSM_orog.png (53.2 KB) - added by ggxmy 4 months ago.
lsm_offPortugal.png (124.9 KB) - added by ggxmy 4 months ago.

Download all attachments as: .zip

Change History (24)

comment:1 Changed 4 months ago by grenville

Masaru

The nested suite (offweur )that works has

rg01_rs02_m01_ic_lbc_src =(0,1)

but

rg01_rs01_m01_ic_lbc_src is not enabled for natl so it appears not have boundary conditions - click on the cog next to rg01_rs01_m01_ic_lbc_src and select Enable.

Grenville

comment:2 Changed 4 months ago by ggxmy

Hi Grenville,

Thank you for your help. Because I couldn't find rg01_rs01_m01_ic_lbs_src in rose edit (and therefore no cog next to it), I removed '!!' in !!rg01_rs01_m01_ic_lbc_src=0,1 in rose-suite.conf, but I think this gave me an error but can't remember what kind.

However, I have been working on a few other things at the same time and now have changed the resolution back to 8 km. I commented out rg01_rs01_m01_ic_lbc_src=0,1 again and running the suite doesn't get this error.

This problem occurred when I set the resolution to 10 km and 5km, so it may come back. Closing the ticket for now but may open it again.

Masaru

Last edited 4 months ago by ggxmy (previous) (diff)

comment:3 Changed 4 months ago by ggxmy

  • Resolution set to worksforme
  • Status changed from new to closed

comment:4 Changed 4 months ago by ggxmy

  • Resolution worksforme deleted
  • Status changed from closed to reopened

comment:5 Changed 4 months ago by ggxmy

I changed the regional domain a little bit with the same 8km resolution, and this error came back… Ran the suite again with rg01_rs01_m01_ic_lbc_src=0,1 uncommented but Regn1_natl_RA2M_um_fcst_000 failed for exactly the same error as the one shown at the top.

Last edited 4 months ago by ggxmy (previous) (diff)

comment:6 Changed 4 months ago by grenville

Hi Masaru

I tried the usual trick of reducing the time step, but that was not successful. The model has become unstable and developed NaNs?.

Could you try shifting and/or changing the size of the natl domain?

Grenville

Changed 4 months ago by grenville

Changed 4 months ago by grenville

comment:7 Changed 4 months ago by grenville

Masaru

The natl domain looks odd - see the attached land sea mask and orography (note it's probably not a good idea to have steep orography on the boundary)

I got these fields from /scratch/d03/myosh/cylc-run/u-cf348/share/cycle/20190710T0000Z/Regn1/natl/RA2M/ics/RA2M_astart

Grenville

comment:8 Changed 4 months ago by ggxmy

Hi Grenville,

I've been aware of the land-sea mask problem but tried not to bother you or CMS with it because I had this and other outstanding tickets as well. This BiCGstab problem sometimes occurred and other times did not, but I had always had this corrupted LSM. So I was guessing these two issues were not related.

I asked many people about the LSM issue and one of my colleagues found that it works fine if we make the nest much larger. We are guessing that the problem happens when the nest goes beyond the prime meridian by only a little bit.

By trial and error I found a good coverage of my interested region without making the nest too big and the BiCGstab problem doesn't happen. I got my second nest working fine as well but am still struggling to set the third nest correct, and it may be related to this LSM problem. If I find something I will report it here.

Thank you for your help.
Masaru

Changed 4 months ago by ggxmy

comment:9 Changed 4 months ago by ggxmy

As I said above, I think I got 2 levels of nests fine. However, the 3rd nest gets corrupted. The image below is when I tried to show the land sea mask for that nest. Notice x and y have the same values everywhere…
/projects/ukca-leeds/myosh/cylc-run/u-cf597/share/data/ancils/Regn1/swWales/qrparm.mask


/home/d03/myosh/cylc-run/u-cf597/log/job/20190710T0000Z/Regn1_swWales_ancil_mask/NN/job.out
shows the process went successfully and I can kind of see the shape of land and sea shown in 1 and 0.

What does it mean? Is the file corrupted after it is created correctly?

Masaru

comment:10 Changed 4 months ago by grenville

Masaru

This is no longer case. It looks OK now.

comment:11 Changed 4 months ago by ggxmy

Hi Grenville,

OK. I didn't explain the situation very clearly.

I aimed to set this nest over off SW Wales and west of Cornwall. If it was successful the nest should have been mostly over the sea. But the data showed land (LSM=1) everywhere. So I made the nest a little larger so that I can see where it actually covers. Then I got above.

After that I tested many different sizes and centre locations but I had no luck. All I got was either LSM=1 everywhere or completely corrupted LSM like above. Well, actually, in one of my attempts I saw the coastline of English Channel so I thought that was a good sign. But after that I have never got any meaningful LSM again, even when I tried setting back to the same setting. So I still have this problem.

Some people suggest that I need to do ANTS (set USE_ANTS = true) where there are small islands that coarse resolution data do not resolve, but in this region I don't think there is any important island. Can that be the problem? ANTS cannot be done now because the Python environment required to do that has been removed or something.

Masaru

comment:12 Changed 4 months ago by ggxmy

now I made the resolution 2 times coarser and I can see a coastline.

but again, this happened only once. Very strange.

The original and aimed resolution is 250m or 0.00225 x 0.00225 degrees. I tried 0.0045 and 0.009 among others. I tried offsetting the centre location (rg01_rs03_offset) by small amounts from 3 to 3.5 in y and from 4 to 4.5 in x. I saw a coastline only twice in many attempts. In all other attempts I got LSM=1 everywhere or a corrupted LSM (x and y have the same values everywhere like in the image above).

Last edited 4 months ago by ggxmy (previous) (diff)

Changed 4 months ago by ggxmy

comment:13 Changed 4 months ago by ggxmy

Now I plotted LSM and orography. So the nest does contain both land and sea areas (looks like Bristol Channel).

This might be related to the issue of LSM for larger nests although this is still in the Western Hemisphere. Obviously this is the area UM nesting suite should be used a lot. If people don't experience this why do I??

Masaru

Changed 4 months ago by ggxmy

comment:14 Changed 4 months ago by ggxmy

Off Portugal is another location where I want to put the 3rd nest. I think I did exactly the same thing as before (u-cf597) except the location and size of the 3rd nest. It simply seems to have worked fine with no problem at all.

Below, 1km (0.009 degree) resolution on the left and 250m (0.00225 degree) on the right. This is from u-cf897.

The image on the right appears to be the same as the one above, but the values are 0 everywhere instead of 1. This is what is expected because this shows the closeup view of the centre of the region on the left. Unlike the case of near British Isles (above), this agrees with orography and land fraction as well.

I just double checked if u-cf597 runs OK now but it did not. So the problem is likely to be related to the location of the nest, I think.


Last edited 4 months ago by ggxmy (previous) (diff)

comment:15 Changed 4 months ago by ggxmy

I messed up u-cf897 before committing it, so I created u-cf930 to replace it. This seems to be working in the same way as u-cf897 did.

Last edited 4 months ago by ggxmy (previous) (diff)

comment:16 Changed 4 months ago by grenville

Masaru

The rotated pole is way out in the Atlantic - the 0.0025 domain is only 1deg in longitude and 1.6deg in latitude - so it will all be sea points.

Grenville

comment:17 Changed 4 months ago by ggxmy

Hi Grenville,

That's what I meant. The nest is created West of Portugal without a problem at all but not near the British Isles. That's the mystery.

Masaru

comment:18 Changed 4 months ago by grenville

Hi Masaru

I am not sure how best to proceed. I can only get a reasonable mask by playing with the size of the domain (the CAP behaves the same way on ARCHER2.)

I'm sure ANTS has run on Monsoon - maybe that's the way to go.

Grenville

comment:19 Changed 4 months ago by ggxmy

I hear ANTS ran fine a few weeks ago but not now because the ANTS environment was corrupted and then removed. Other people in Leeds are also working on ANTS so I can ask them how it is going.

But I have a feeling that the corruption of land-sea mask may be a separate issue from ANTS problem because when I run the suite with ANTS turned on, land-sea mask is created (incorrectly) before ANTS processes fail.

Also, the problem of the land-sea mask with the first nest has not been resolved but avoided only by finding a regional domain that does not suffer from the problem. The current issue with the third nest may be related to the initial problem.

Note: See TracTickets for help on using tickets.