Opened 2 months ago

Last modified 3 days ago

#2624 new help

Archer run failing

Reported by: admg26 Owned by: um_support
Priority: normal Component: UKCA
Keywords: BiCGstab Cc:
Platform: ARCHER UM Version: 10.9



This is a run that was ported from Monsoon to Archer and was running fine. See ticket

Platform: Archer
Username: admg26
Suite: u-ba626 on Archer (Similar to suite u-az337 on Monsoon that is working okay)
Code: um.x_br/dev/alisonming/vn10.9_volcanic_emissions

Error message after 3 months:

???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 1
? Error from routine: EG_BICGSTAB
? Error message: NaNs? in error term in BiCGstab after 1 iterations
? This is a common point for the model to fail if it
? has ingested or developed NaNs? or infinities
? elsewhere in the code.
? See the following URL for more information:
? Error from processor: 19
? Error number: 43

Many thanks,

Change History (10)

comment:1 Changed 2 months ago by willie

Hi Alison,

It's failing with the error above at the cycle 19881101T0000Z after running for 421 time steps (just over five days). The URL suggests setting the UM print status to "Extra diagnostic messages" in um → Run time controls → Atmosphere only and repeating the run from this cycle. This will provide further information about the problem.

Also in your suite, the UM meta data (under um) is wrong - it set to um-atmos/HEAD. It should be um-atmos/vn10.9.


comment:2 Changed 2 months ago by admg26


I have changed the UM meta data and re run the model with Extra diagnostic messages.

Many thanks,

Last edited 2 months ago by admg26 (previous) (diff)

comment:3 Changed 7 weeks ago by willie

  • Keywords BiCGstab added; Archer removed

Hi Alison,

I took a copy of your job, removed one STASH duplicate and re-indexed and then switched to daily dumps. All the dumps are NaN free. This fails with the BiCGstab error but a bit earlier than yours, in 19881001.

I then switched from 72 time steps per day to 144. This also fails at the same point with the BiCGstab error.

I'll try increasing the number of processors …


comment:4 Changed 7 weeks ago by admg26


Thanks! I am away next week so no rush. I was meaning to make a copy of the suite from an earlier point (because it was running at some point in the past after I ported it from MonSooN) and then see which change broke the run.


comment:5 Changed 7 weeks ago by willie

Hi Alison,

I tried 24x28 processors and that failed too. The fail point (time step) is varying slightly between the runs. For the last three runs

Atm_Step: Timestep     3616   Model time:   1988-10-21 05:20:00
Atm_Step: Timestep     7345   Model time:   1988-10-22 00:10:00
Atm_Step: Timestep     3673   Model time:   1988-10-22 00:20:00

This is a bit odd given that I'm only making small changes.

You're also getting

?  Warning code: -10
?  Warning from routine: ANCIL_CHECK_GRID_STAGGER
?  Warning message: Ancil file mismatch in fixed header(9) grid stagger value
?          Model grid stagger = 6
?          Ancil file grid stagger = 3
?          Ancil file path = /work/n02/n02/ukca/ancil/n96e/sstice/sice_clim_1996-2005_360d.n96e
?          PLEASE READ - this warning will be converted to an error
?          in future. Please update ancil file to specify the correct
?          grid stagger value.

on the SST/ICE ancillary - I don't know if that's important.


comment:6 Changed 4 weeks ago by willie

Hi Alison,

Is this still an issue?


comment:7 Changed 4 weeks ago by admg26


Yes it is. I have been trying various things including rolling back to earlier versions. Now I have another error message I am trying to sort out.

sys-2 : UNRECOVERABLE error on system request 
  No such file or directory

Encountered during an OPEN of unit 14
Fortran unit 14 is not connected
Application 32743620 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 0 starting:
ATP Stack walkback for Rank 0 done
Process died with signal 6: 'Aborted'
Forcing core dump of rank 0
View application merged backtrace tree with: stat-view
You may need to: module load stat

I am away for the next 2.5 weeks. Could the ticket remain open please?


comment:8 Changed 4 weeks ago by willie

Hi Alison,

I'll keep the ticket open. It looks like you are missing an aerosol file needed by UKCA.


comment:9 Changed 4 weeks ago by admg26


Thanks! I will check all the input files.


comment:10 Changed 3 days ago by admg26


I found my missing file and the model is now running for a month again and crashing with the same BICGSTAB error.

I think I am going to start again with a working suite on Monsoon and port that to Archer. I cannot find what is wrong with this suite, having rolled back to the first version that was meant to be working.


Note: See TracTickets for help on using tickets.