Opened 12 days ago

Last modified 3 days ago

#2401 new help

Possible compiler bug on Archer

Reported by: apm Owned by: um_support
Priority: normal Component: NEMO/CICE
Keywords: Cc: acc@…
Platform: ARCHER UM Version:


I have been struggling to complete my integrations of the GO6 version of NEMO under Rose/Cylc? on Archer. I get intermittent and unpredictable failures at the start of the annual run cycle where the model simply stops during the first timestep.

I started adding write statements to the extracted code and finally narrowed it down to a particular NEMO source file lbclnk.F90, and found, to my surprise, that once I had added enough write statements the code ran without a problem. Andrew Coward suggested that this might be because my write statements changed the optimisation level, and that this was likely to point to a bug in the compiler. In support of this, the same NEMO configuration runs perfectly on Monsoon.

Is there anything I can do about this? I am reluctant to downgrade the optimisation if it leads to a significant penalty in performance. Is there a more recent version of the compiler I could try? It's not obvious in the Rose job which compiler is actually used. I would guess that changing compiler might lead to inconsistencies elsewhere.



Change History (7)

comment:1 Changed 9 days ago by willie

Hi Alex,

What suite id and what computer are you running on?


comment:2 Changed 5 days ago by apm

Hi Willie,

Thanks for the reply. The suite id is u-ao882, and I am running on Archer from Rose on Puma.


comment:3 Changed 5 days ago by willie

  • Component changed from UM Model to NEMO/CICE
  • Platform set to ARCHER

Hi Alex,

You seem to be still working on this. If you can do a rebuild and run from fresh, this will get the essential files back to PUMA and we'll have a look.


comment:4 Changed 5 days ago by apm

Hi Willie,

Maybe u-ao882 isn't the best one to pass over - I have been progressively adding write statements to the extracted code for this suite until it runs reasonably reliably, but it seems to stop at one particular point during the cycle, possibly because of a different problem. If I rebuild it I will lose all my write statements.

I have made a copy of it, which is now called u-av351.



comment:5 Changed 3 days ago by willie

Hi Alex,

Could you run u-av351 please? I can't see your ARCHER files since you're in n01. This will at least get the files back to PUMA.

I was trying to get an idea of what the intermittent and unpredictable failures were. Are these still occurring? If so could you paste the error messages here.


comment:6 Changed 3 days ago by willie

  • UM Version <select version> deleted

comment:7 Changed 3 days ago by apm

Hi WIllie,

I have just submitted u-av351: compilation is now complete and it is waiting to run.

The failure mode is for the job simply to stop. There is no error in the model output files, and there is an MPI_Abort error in the job.err file such as the following:

Rank 0 [Tue Oct 31 20:23:38 2017] [c7-2c1s1n0] application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
_pmiu_daemon(SIGCHLD): [NID 04484] [c7-2c1s1n0] [Tue Oct 31 20:23:38 2017] PE RANK 0 exit signal Aborted
[NID 04484] 2017-10-31 20:23:38 Apid 28924547: initiated application termination
[FAIL] run_nemo_cice # return-code=137
Received signal ERR
cylc (scheduler - 2017-10-31T20:23:50Z): CRITICAL Task job script received signal ERR at 2017-10-31T20:23:50Z
cylc (scheduler - 2017-10-31T20:23:50Z): CRITICAL failed at 2017-10-31T20:23:50Z


Note: See TracTickets for help on using tickets.