Opened 6 months ago

Closed 4 months ago

#2392 closed help (answered)

NaNs in error term in BiCGstab

Reported by: s.varma13 Owned by: um_support
Priority: high Component: UM Model
Keywords: BiCGstab error Cc:
Platform: Monsoon2 UM Version: 10.8

Description

Hi I am almost at the end of my suite run u-au329 and the following error has occurred.

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 1
? Error from routine: EG_BICGSTAB
? Error message: NaNs? in error term in BiCGstab after 1 iterations
? This is a common point for the model to fail if it
? has ingested or developed NaNs? or infinities
? elsewhere in the code.
? See the following URL for more information:
? https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints

I have looked at the wiki information and just wanted to run through what I should do.
dia.

Wiki says "to run the model with output diagnostics set to high ([env]PRINT_STATUS=PrStatus_Diag). This will identify if a NaN has been generated by a physics scheme and allows you to narrow down where the problem is."

PRINT_STATUS appears in um > env > runtime controls > atmosphere only but there is no option for PrStatus_Diag.

There is an option for all information messages and an option for extra diagnostic messages. Should I turn on "all information", save the suite and then either restart from the GUI (Trigger - run now) or rose suite-run –restart?

Do you perhaps have a better solution for this issue?

Many thanks

Sunil

Change History (12)

comment:1 Changed 6 months ago by s.varma13

Hi, just to let you know this error has occurred for three other simulations (u-au333, u-au341 and u-au342) which I am running using the same template for u-au329 . All but u-au342 have automatically stopped so doing rose sgc receives a fail message. u-au342 is still running but failed if you need to see an open suite.

Many thanks

Sunil

comment:2 Changed 6 months ago by s.varma13

Hi, just wondering if anyone had a chance to look at this.

Many thanks

Sunil

comment:3 Changed 6 months ago by ros

Note to add in an offline email advised:

In um > env > runtime controls > atmosphere only set PRINT_STATUS to "Extra diagnostic messages" and restart with rose suite-run --restart. This sets the value of PRINT_STATUS=PrStatus_Diag.

comment:4 Changed 6 months ago by s.varma13

Thanks a lot Ros. I just made that change, saved it and then restarted run u-au329. It failed instantly but the error log has no more information than when I was just running normal diagnostics. Thanks. Sunil

An example:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 1
? Error from routine: EG_BICGSTAB
? Error message: NaNs? in error term in BiCGstab after 1 iterations
? This is a common point for the model to fail if it
? has ingested or developed NaNs? or infinities
? elsewhere in the code.
? See the following URL for more information:
? https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
? Error from processor: 397
? Error number: 35
????????????????????????????????????????????????????????????????????????????????

[397] exceptions: An non-exception application exit occured.
[397] exceptions: whilst in a serial region
[397] exceptions: Task had pid=40108 on host nid04173
[397] exceptions: Program is "/home/d04/suvar/cylc-run/u-au329/share/fcm_make_um/build-atmos/bin/um-atmos.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
? See the following URL for more information:
? https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
? Error from processor: 332
? Error number: 35
????????????????????????????????????????????????????????????????????????????????

[332] exceptions: An non-exception application exit occured.
[332] exceptions: whilst in a serial region
[332] exceptions: Task had pid=31475 on host nid04165
[332] exceptions: Program is "/home/d04/suvar/cylc-run/u-au329/share/fcm_make_um/build-atmos/bin/um-atmos.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked

comment:5 Changed 6 months ago by willie

Hi Sunil,

You could try halving, or even quartering the time step. This type of problem can be difficult to solve, but that is a good start.

Regards
Willie

comment:6 Changed 6 months ago by s.varma13

Thanks Willie. How do I do that?

Sunil

comment:7 Changed 6 months ago by willie

Hi Sunil,

You'll find it under um → namelist → Top Level … → Model Domain and Timestep.

Doubling steps_per_periodim will do the trick.

Willie

comment:8 Changed 6 months ago by s.varma13

Hi Willie

I stopped the suite, doubled the time steps to 144 and then to 216, saved it both times and then did rose suite-run —restart

It failed immediately and I received the same error message.

Any other suggestions?

Many thanks

Sunil

comment:9 Changed 5 months ago by willie

Hi Sunil,

Did you manage to solve this?

Regards
Willie

comment:10 Changed 5 months ago by s.varma13

Hi Willie

No I did not. I had to start a new set of simulation (u-av309, u-av362 and u-av363) for the same length of time as my monthly output from this simulation was not correct so I ended not needing to continue this run.

However all three of the new simulation also stopped at exactly the same time - Jan 2013. I followed your instructions above and again, the error log gave me no further information. I need to have them run for two more years. What do you suggest?

Many thanks

Sunil

comment:11 Changed 5 months ago by willie

Hi Sunil,

I had a look at a-au329, which started this ticket, and you seemed to have gotten past the BiCGstab problem. If the methods used for u-au329 haven't worked then I'm out of ideas at the moment.

Regards
Willie

comment:12 Changed 4 months ago by willie

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.