Opened 3 years ago

Closed 2 years ago

#2624 closed help (answered)

Archer run failing

Reported by: admg26 Owned by: um_support
Component: UKCA Keywords: BiCGstab
Cc: Platform: ARCHER
UM Version: 10.9



This is a run that was ported from Monsoon to Archer and was running fine. See ticket

Platform: Archer
Username: admg26
Suite: u-ba626 on Archer (Similar to suite u-az337 on Monsoon that is working okay)
Code: um.x_br/dev/alisonming/vn10.9_volcanic_emissions

Error message after 3 months:

???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 1
? Error from routine: EG_BICGSTAB
? Error message: NaNs? in error term in BiCGstab after 1 iterations
? This is a common point for the model to fail if it
? has ingested or developed NaNs? or infinities
? elsewhere in the code.
? See the following URL for more information:
? Error from processor: 19
? Error number: 43

Many thanks,

Change History (11)

comment:1 Changed 3 years ago by willie

Hi Alison,

It's failing with the error above at the cycle 19881101T0000Z after running for 421 time steps (just over five days). The URL suggests setting the UM print status to "Extra diagnostic messages" in um → Run time controls → Atmosphere only and repeating the run from this cycle. This will provide further information about the problem.

Also in your suite, the UM meta data (under um) is wrong - it set to um-atmos/HEAD. It should be um-atmos/vn10.9.


comment:2 Changed 3 years ago by admg26


I have changed the UM meta data and re run the model with Extra diagnostic messages.

Many thanks,

Last edited 3 years ago by admg26 (previous) (diff)

comment:3 Changed 2 years ago by willie

  • Keywords BiCGstab added; Archer removed

Hi Alison,

I took a copy of your job, removed one STASH duplicate and re-indexed and then switched to daily dumps. All the dumps are NaN free. This fails with the BiCGstab error but a bit earlier than yours, in 19881001.

I then switched from 72 time steps per day to 144. This also fails at the same point with the BiCGstab error.

I'll try increasing the number of processors …


comment:4 Changed 2 years ago by admg26


Thanks! I am away next week so no rush. I was meaning to make a copy of the suite from an earlier point (because it was running at some point in the past after I ported it from MonSooN) and then see which change broke the run.


comment:5 Changed 2 years ago by willie

Hi Alison,

I tried 24x28 processors and that failed too. The fail point (time step) is varying slightly between the runs. For the last three runs

Atm_Step: Timestep     3616   Model time:   1988-10-21 05:20:00
Atm_Step: Timestep     7345   Model time:   1988-10-22 00:10:00
Atm_Step: Timestep     3673   Model time:   1988-10-22 00:20:00

This is a bit odd given that I'm only making small changes.

You're also getting

?  Warning code: -10
?  Warning from routine: ANCIL_CHECK_GRID_STAGGER
?  Warning message: Ancil file mismatch in fixed header(9) grid stagger value
?          Model grid stagger = 6
?          Ancil file grid stagger = 3
?          Ancil file path = /work/n02/n02/ukca/ancil/n96e/sstice/sice_clim_1996-2005_360d.n96e
?          PLEASE READ - this warning will be converted to an error
?          in future. Please update ancil file to specify the correct
?          grid stagger value.

on the SST/ICE ancillary - I don't know if that's important.


comment:6 Changed 2 years ago by willie

Hi Alison,

Is this still an issue?


comment:7 Changed 2 years ago by admg26


Yes it is. I have been trying various things including rolling back to earlier versions. Now I have another error message I am trying to sort out.

sys-2 : UNRECOVERABLE error on system request 
  No such file or directory

Encountered during an OPEN of unit 14
Fortran unit 14 is not connected
Application 32743620 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 0 starting:
ATP Stack walkback for Rank 0 done
Process died with signal 6: 'Aborted'
Forcing core dump of rank 0
View application merged backtrace tree with: stat-view
You may need to: module load stat

I am away for the next 2.5 weeks. Could the ticket remain open please?


comment:8 Changed 2 years ago by willie

Hi Alison,

I'll keep the ticket open. It looks like you are missing an aerosol file needed by UKCA.


comment:9 Changed 2 years ago by admg26


Thanks! I will check all the input files.


comment:10 Changed 2 years ago by admg26


I found my missing file and the model is now running for a month again and crashing with the same BICGSTAB error.

I think I am going to start again with a working suite on Monsoon and port that to Archer. I cannot find what is wrong with this suite, having rolled back to the first version that was meant to be working.


comment:11 Changed 2 years ago by willie

  • Resolution set to answered
  • Status changed from new to closed

Hi Alison,

I'll close this ticket now. If you are still having problems with the new suite you can create a new ticket then.


Note: See TracTickets for help on using tickets.