Opened 2 months ago

Last modified 3 days ago

#2624 new help

Archer run failing

Reported by: admg26 Owned by: um_support
Priority: normal Component: UKCA
Keywords: BiCGstab Cc:
Platform: ARCHER UM Version: 10.9

Description

Hello,

This is a run that was ported from Monsoon to Archer and was running fine. See ticket http://cms.ncas.ac.uk/ticket/2460.

Platform: Archer
Username: admg26
Suite: u-ba626 on Archer (Similar to suite u-az337 on Monsoon that is working okay)
Code: um.x_br/dev/alisonming/vn10.9_volcanic_emissions

Error message after 3 months:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 1
? Error from routine: EG_BICGSTAB
? Error message: NaNs? in error term in BiCGstab after 1 iterations
? This is a common point for the model to fail if it
? has ingested or developed NaNs? or infinities
? elsewhere in the code.
? See the following URL for more information:
? https://code.metoffice.gov.uk/trac/um/wiki/KnownUMFailurePoints
? Error from processor: 19
? Error number: 43
????????????????????????????????????????????????????????????????????????????????

Many thanks,
Alison

Change History (10)

comment:1 Changed 2 months ago by willie

Hi Alison,

It's failing with the error above at the cycle 19881101T0000Z after running for 421 time steps (just over five days). The URL suggests setting the UM print status to "Extra diagnostic messages" in um → Run time controls → Atmosphere only and repeating the run from this cycle. This will provide further information about the problem.

Also in your suite, the UM meta data (under um) is wrong - it set to um-atmos/HEAD. It should be um-atmos/vn10.9.

Regards,
Willie

comment:2 Changed 2 months ago by admg26

Hello,

I have changed the UM meta data and re run the model with Extra diagnostic messages.

Many thanks,
Alison

Last edited 2 months ago by admg26 (previous) (diff)

comment:3 Changed 7 weeks ago by willie

  • Keywords BiCGstab added; Archer removed

Hi Alison,

I took a copy of your job, removed one STASH duplicate and re-indexed and then switched to daily dumps. All the dumps are NaN free. This fails with the BiCGstab error but a bit earlier than yours, in 19881001.

I then switched from 72 time steps per day to 144. This also fails at the same point with the BiCGstab error.

I'll try increasing the number of processors …

Regards
Willie

comment:4 Changed 7 weeks ago by admg26

Hi,

Thanks! I am away next week so no rush. I was meaning to make a copy of the suite from an earlier point (because it was running at some point in the past after I ported it from MonSooN) and then see which change broke the run.

Cheers,
Alison

comment:5 Changed 7 weeks ago by willie

Hi Alison,

I tried 24x28 processors and that failed too. The fail point (time step) is varying slightly between the runs. For the last three runs

Atm_Step: Timestep     3616   Model time:   1988-10-21 05:20:00
Atm_Step: Timestep     7345   Model time:   1988-10-22 00:10:00
Atm_Step: Timestep     3673   Model time:   1988-10-22 00:20:00

This is a bit odd given that I'm only making small changes.

You're also getting

?  Warning code: -10
?  Warning from routine: ANCIL_CHECK_GRID_STAGGER
?  Warning message: Ancil file mismatch in fixed header(9) grid stagger value
?          Model grid stagger = 6
?          Ancil file grid stagger = 3
?          Ancil file path = /work/n02/n02/ukca/ancil/n96e/sstice/sice_clim_1996-2005_360d.n96e
?          PLEASE READ - this warning will be converted to an error
?          in future. Please update ancil file to specify the correct
?          grid stagger value.

on the SST/ICE ancillary - I don't know if that's important.

Regards,
Willie

comment:6 Changed 4 weeks ago by willie

Hi Alison,

Is this still an issue?

Willie

comment:7 Changed 4 weeks ago by admg26

Hi,

Yes it is. I have been trying various things including rolling back to earlier versions. Now I have another error message I am trying to sort out.

sys-2 : UNRECOVERABLE error on system request 
  No such file or directory

Encountered during an OPEN of unit 14
Fortran unit 14 is not connected
Application 32743620 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 0 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  main@um_main.F90:20
  main@um_main.F90:20
  um_shell_@um_shell.F90:652
  u_model_4a_@u_model_4A.F90:370
  atm_step_4a_@atm_step_4A.F90:5496
  ukca_main1$ukca_main1_mod_@ukca_main1-ukca_main1.F90:1179
  ukca_read_aerosol$ukca_read_aerosol_mod_@ukca_read_aerosol.F90:201
  _OPEN@0x2cfb52e
  __OPN@0x2cfb390
  _f_open@0x2ccd25b
  _ferr@0x2d040fb
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 0 done
Process died with signal 6: 'Aborted'
Forcing core dump of rank 0
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

I am away for the next 2.5 weeks. Could the ticket remain open please?

Cheers,
Alison

comment:8 Changed 4 weeks ago by willie

Hi Alison,

I'll keep the ticket open. It looks like you are missing an aerosol file needed by UKCA.

Willie

comment:9 Changed 4 weeks ago by admg26

Hi,

Thanks! I will check all the input files.

A

comment:10 Changed 3 days ago by admg26

Hi,

I found my missing file and the model is now running for a month again and crashing with the same BICGSTAB error.

I think I am going to start again with a working suite on Monsoon and port that to Archer. I cannot find what is wrong with this suite, having rolled back to the first version that was meant to be working.

Cheers,
Alison

Note: See TracTickets for help on using tickets.