Opened 9 months ago

Closed 6 months ago

#2386 closed help (fixed)

EG_BICGSTAB_MIXED_PREC error in forecast stage of nesting suite

Reported by: nx902220 Owned by: willie
Priority: normal Component: UM Model
Keywords: convergence, BiCGstab Cc:
Platform: Monsoon2 UM Version: 10.5

Description

Hi,

I am running a nesting suite u-au206 and in the first nest at the ukv_um_fcst stage it fails with job.err:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 11
? Error from routine: EG_BICGSTAB_MIXED_PREC
? Error message: Convergence failure in BiCGstab, omg is too small
? Error from processor: 256
? Error number: 21
????????????????????????????????????????????????????????????????????????????????

I cannot find the cause of this issue (I have a colleague who has also run the suite run the suite and experienced the same problem and could not find a solution). The suite runs for another case study date 04-05-2016. But not this one 11-06-2015.

All I have changed between model runs is the case study date in rose-app.conf and the paths to the reanalysis data in retrieve_ukv_data

If you could investigate this for me I would be very appreciative.

Best wishes,

Lewis

Change History (26)

comment:1 Changed 8 months ago by willie

Hi Lewis,

This is failing at the second time step.

Normally this type of error is corrected by reducing the time step. I have tried 30 sec and 15 sec and these haven't worked. I have also tried increasing the number of convection calls per time step to three, to no avail. I have used a MOCI tool to perturb the temperature field in the start dump and this has not succeeded. I then tried the previous day 2015-06-10, but it too fails after two time steps.

This may take some time to solve.

Regards
Willie

comment:2 Changed 8 months ago by willie

  • Keywords convergence, BICGSTAB added
  • Platform set to Monsoon2
  • UM Version changed from <select version> to 10.5

comment:3 Changed 8 months ago by willie

  • Keywords BiCGstab added; BICGSTAB removed

Hi Lewis,

I have devised a work around. I have created a new start dump by taking a global start dump for 20150611T0000Z and processing it with a more modern UM than the one you were using (yours was created at UM9.0 and stored in MASS). I used a u-av666, which is UM10.6 PS37.

The new start dump can be found at ~frmy/cylc-run/u-av666/share/cycle/20150611T0000Z/ukv1_exp8.rcf on Monsoon.

I ran it in u-au206 for three model hours, so it gets past the BiCGstab problem. To do this I had to switch l_murk and l_autoconv_murk off, as the total aerosol field is not present in the start dump.

Regards
Willie

comment:4 Changed 8 months ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

comment:5 Changed 8 months ago by willie

Hi Lewis,

These are the changes I made in retrieve_ukv_data

#moo_get_if_needed $source $target
# replace above line with this
ln -s /home/d04/frmy/cylc-run/u-av666/share/cycle/20150611T0000Z/ukv1_exp8.rcf $target

You just need to change the start date in a-av666 and run it to get the new UKV start dump and make a link to it as above.

regards
Willie

comment:6 Changed 8 months ago by nx902220

Hi Willie,

I copied your suite u-av666 and in my area it is u-av871.
I run it with no changes.

In lam_ukv_recon_ukv1_exp8 it fails with:

[WARN] file:STASHC: skip missing optional source: namelist:exclude_package(:)
[WARN] file:IOSCNTL: skip missing optional source: namelist:lustre_control
[WARN] file:IOSCNTL: skip missing optional source: namelist:lustre_control_custom_files
[WARN] file:IDEALISE: skip missing optional source: namelist:idealise
[WARN] file:RECONA: skip missing optional source: namelist:trans(:)
/bin/sh: um-recon: command not found
[FAIL] um-recon # return-code=127
2018-03-09T16:33:11Z CRITICAL - Task job script received signal EXIT

please can you help me with this?

Best wishes,

Lewis

Last edited 7 months ago by willie (previous) (diff)

comment:7 Changed 7 months ago by nx902220

Hi Willie,

I tried running u-au206 using the UKV start dump from u-av666.

The change I made in /bin/retrieve_ukv_data was:

#———————————————————————————————————————-
# UKV start dump
#———————————————————————————————————————-
#source="moose:opfc/atm/ukv/rerun/${ym}.file/${ymd}_qwqv${hour}.T+1"
#source="moose:/opfc/atm/ukv/rerun/psuite35.file/${ymd}${hour}_ukv_t+1"
#source="moose:opfc/atm/ukv/rerun/${ym}.file/${ymd}T${hour}00Z_ukv_t+1" # use for 2016 dates source="moose:/opfc/atm/ukv/rerun/${ym}.file/${ymd}${hour}_ukv_t+1" # use for 2015 dates

target=$ROSE_DATAC/ukv_t+1
#moo_get_if_needed $source $target
# replace above line with this
ln -s /home/d04/frmy/cylc-run/u-av666/share/cycle/20150611T0000Z/ukv1_exp8.rcf $target

So I included the symbolic link. I also set l_murk and l_autoconv_murk to false. Do I need to turn Total Aerosol stash request off as well?

At the moment u-au206 is failing in ukv_um_recon with error message:

NetCDF Files to be opened :
Processing Orography (stashcode 33)
Using Ancillary Orography
Vertical interpolation has been switched on due to a change in orography

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 30
? Error from routine: Rcf_Set_Data_Source
? Error message: Section 0 Item 57 : Required field is not in input dump!
? Error from processor: 0
? Error number: 2
????????????????????????????????????????????????????????????????????????????????

Best wishes,

Lewis

comment:8 Changed 7 months ago by willie

Lewis,
Did you switch off both l_murk and l_autoconv_murk?

Willie

comment:9 Changed 7 months ago by nx902220

I turned them from true to false. Should I have put 2 exclamation marks at the front instead?

comment:10 Changed 7 months ago by willie

Hi Lewis,

There are a number of Rose/cylc errors here, I don't know how they came about.

The l_murk is in Section 17 on the Murk Aerosol page: set this to false. There should be no errors on the page.

Search for l_autoconv_murk and set it to false. There should be no errors on this page either.

Try toggling the buttons in the GUI.

Regards
Willie

comment:11 Changed 7 months ago by nx902220

Hi Willie,

I was just setting l_murk and l_autoconv_murk to false in rose-app.conf.

I have repeated again but toggling in the GUI. I could not see errors on the pages.

It fails at ukv_um_fcst with error:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 101
? Error from routine: INBOUNDA
? Error message: Boundary data starts after start of current boundary data interval
? Error from processor: 0
? Error number: 21
????????????????????????????????????????????????????????????????????????????????

I'm not sure why these changes would cause this error.

Best wishes,

Lewis

comment:12 Changed 7 months ago by willie

Hi Lewis,

It's important not to edit things outside the GUI: the defeats the checking. If you look at the l_murk page, toggling that variable influences the state of the other four variables on the page. Ditto for l_autoconv_murk. So now you have an inconsistent state and need to go back a few steps to get to the original, then toggle the variables from inside the GUI.

Regards
Willie

comment:13 Changed 7 months ago by nx902220

Hi Willie,

Before I switched them off in the GUI I changed everything back to its original state in the .conf file (I even did a UNIX diff on the rose-app.conf files between u-au206 and u-at199 to check the murk settings with the same as before I had messed with the .conf file). But as you say I may not have done this properly and there may have been an inconsistent state. Because I could not be sure which thing in .conf was causing the problem I decided to start again, take another copy of u-at199 and make the changes exactly as you suggested in u-au206. This suite is called u-aw051.

When I run this I get exactly the same error as in comment 11. Is there something I need to change in retrieve_ukv_data that you have forgot to mention?

I know that T start is at 03 hours but isn't the start dump from 00 hours? Do I need to make a change in retrieve_ukv_data to the UKV frame part with the .gz file?

I followed your steps exactly in this new suite.

Best wishes,

Lewis

comment:14 Changed 7 months ago by willie

Hi Lewis,

Yes you need to match the start dump start and probably extend the runs to cover your period of interest. I haven't done that. I was just trying to get past the BiCGstab problem, which I think I have. Let me know if I'm wrong.

Regards
Willie

comment:15 Changed 7 months ago by nx902220

Hi Willie,

Thanks. I have got past the BICGstab problem in the UKV.

I still cannot run my copy of your suite u-av666. I have got past my problem in comment 6 by turning on build in the jinja section as you suggested.

It now fails later in fcm_make_um_lam_ctrl. job.out:

[done] make preprocess-atmos# 15.8s
[init] make build-atmos # 2018-03-19T16:43:43Z
[info] sources: total=2248, analysed=2248, elapsed-time=8.5s, total-time=42.2s
============================= PBS epilogue =============================

error: process optcg used more than 4096000kB of memory on node shared100
error: job terminated

If I need to increase memory please can you tell me how to do this? I cannot find any sections that look like the right place in the GUI.

Best wishes,

Lewis

comment:16 Changed 7 months ago by willie

Hi Lewis,

It's in the suite.rc file. You need to change

 -l mem = 4000MB

in various places to 8000MB. I thought I had got them all.

Regards
Willie

comment:17 Changed 7 months ago by nx902220

Hi Willie,

It now fails with job.out in lam_um_recon_ukv1_exp8:

c_io ( 11):Open: File=/home/d04/lblunn/cylc-run/u-av871/share/cycle/20150611T0000Z/glm_t+0
c_io ( 11):c_io_unix: open: ERROR: cannot create a new file in read-only mode.
c_io ( 11):Open: ERROR: file open failed (/home/d04/lblunn/cylc-run/u-av871/share/cycle/20150611T0000Z/glm_t+0)
c_io ( 11):OPEN: WARNING: failed to open file /home/d04/lblunn/cylc-run/u-av871/share/cycle/20150611T0000Z/glm_t+0

.
.
.

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 2
? Error from routine: io:file_open
? Error message: Failed to open file /home/d04/lblunn/cylc-run/u-av871/share/cycle/20150611T0000Z/glm_t+0
? Error from processor: 0
? Error number: 1
????????????????????????????????????????????????????????????????????????????????

I can see that glm_t+0 does not exist. Do I need to change read/write settings?

In retrieve_startdata.sh I can see ISO_SWITCH_DATE=20150825T0600Z. Should it be ISO_SWITCH_DATE=20150611T0000Z?

Best wishes

Lewis

comment:18 Changed 7 months ago by willie

Hi Lewis,

I compared (diff -r ) your av871 with av666 and they seemed almost identical. I won't be able to look at this until next week. In the meantime you could look at the job error files to see if there is anything amiss.

Regards
Willie

comment:19 Changed 7 months ago by willie

Hi Lewis,

Is this still an issue?

Regards
Willie

comment:20 Changed 7 months ago by nx902220

Hi Willie,

Using the start dump from the global model that you gave me I can run the nesting suite all the way through without errors i.e. it solves the EG_BICGSTAB_MIXED_PREC problem.

However I cannot create my own start dumps for different case study dates using my copy of your global suite av871. I cannot find the problem.

I would still be appreciative if you could help with this.

Best wishes,

Lewis

comment:21 Changed 6 months ago by willie

Hi Lewis,

What is the suite-id? What happens when you run it?

Regards
Willie

comment:22 Changed 6 months ago by nx902220

Hi Willie,

The suite name is av871. The error is in comment 17 above.

Cheers,

Lewis

comment:23 Changed 6 months ago by willie

Hi Lewis,

If you look in the install_cold job.err file, you will see that it has failed because it is trying to archive to my area in MASS. This means it doesn't install the rest of the files it needs. You need to change this in the suite.rc file.

Regards
Willie

Last edited 6 months ago by willie (previous) (diff)

comment:24 Changed 6 months ago by nx902220

Hi Willie,

Thank you I can now create the dump file. As well as changing the archiving I had to remove commenting from retrieve_startdata.sh

Thank you persevering with me. All the best,

Lewis

comment:25 Changed 6 months ago by willie

  • Status changed from accepted to new

comment:26 Changed 6 months ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.