Opened 6 months ago

Closed 4 months ago

#2401 closed help (answered)

Possible compiler bug on Archer

Reported by: apm Owned by: willie
Priority: normal Component: NEMO/CICE
Keywords: Cc: acc@…
Platform: ARCHER UM Version:

Description

I have been struggling to complete my integrations of the GO6 version of NEMO under Rose/Cylc? on Archer. I get intermittent and unpredictable failures at the start of the annual run cycle where the model simply stops during the first timestep.

I started adding write statements to the extracted code and finally narrowed it down to a particular NEMO source file lbclnk.F90, and found, to my surprise, that once I had added enough write statements the code ran without a problem. Andrew Coward suggested that this might be because my write statements changed the optimisation level, and that this was likely to point to a bug in the compiler. In support of this, the same NEMO configuration runs perfectly on Monsoon.

Is there anything I can do about this? I am reluctant to downgrade the optimisation if it leads to a significant penalty in performance. Is there a more recent version of the compiler I could try? It's not obvious in the Rose job which compiler is actually used. I would guess that changing compiler might lead to inconsistencies elsewhere.

Thanks,

Alex

Change History (13)

comment:1 Changed 6 months ago by willie

Hi Alex,

What suite id and what computer are you running on?

Regards
Willie

comment:2 Changed 6 months ago by apm

Hi Willie,

Thanks for the reply. The suite id is u-ao882, and I am running on Archer from Rose on Puma.

Alex

comment:3 Changed 6 months ago by willie

  • Component changed from UM Model to NEMO/CICE
  • Platform set to ARCHER

Hi Alex,

You seem to be still working on this. If you can do a rebuild and run from fresh, this will get the essential files back to PUMA and we'll have a look.

Regards
Willie

comment:4 Changed 6 months ago by apm

Hi Willie,

Maybe u-ao882 isn't the best one to pass over - I have been progressively adding write statements to the extracted code for this suite until it runs reasonably reliably, but it seems to stop at one particular point during the cycle, possibly because of a different problem. If I rebuild it I will lose all my write statements.

I have made a copy of it, which is now called u-av351.

Regards,

Alex

comment:5 Changed 6 months ago by willie

Hi Alex,

Could you run u-av351 please? I can't see your ARCHER files since you're in n01. This will at least get the files back to PUMA.

I was trying to get an idea of what the intermittent and unpredictable failures were. Are these still occurring? If so could you paste the error messages here.

Regards
Willie

comment:6 Changed 6 months ago by willie

  • UM Version <select version> deleted

comment:7 Changed 6 months ago by apm

Hi WIllie,

I have just submitted u-av351: compilation is now complete and it is waiting to run.

The failure mode is for the job simply to stop. There is no error in the model output files, and there is an MPI_Abort error in the job.err file such as the following:

Rank 0 [Tue Oct 31 20:23:38 2017] [c7-2c1s1n0] application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
_pmiu_daemon(SIGCHLD): [NID 04484] [c7-2c1s1n0] [Tue Oct 31 20:23:38 2017] PE RANK 0 exit signal Aborted
[NID 04484] 2017-10-31 20:23:38 Apid 28924547: initiated application termination
[FAIL] run_nemo_cice # return-code=137
Received signal ERR
cylc (scheduler - 2017-10-31T20:23:50Z): CRITICAL Task job script received signal ERR at 2017-10-31T20:23:50Z
cylc (scheduler - 2017-10-31T20:23:50Z): CRITICAL failed at 2017-10-31T20:23:50Z

Alex

Last edited 4 months ago by willie (previous) (diff)

comment:8 Changed 6 months ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Alex,

u-av351 appears to have completed 19580101T0000Z successfully. There were no compile problems and the tasks completed successfully. Is it still running?

Regards
Willie

comment:9 Changed 6 months ago by apm

Hi Willie,

Yes, it's now in its second year. The trouble with this bug is that you can't predict when it is going to hit!

Alex

comment:10 Changed 6 months ago by willie

Hi Alex,
Let's keep an eye on it. Do you know if the run time gets anywhere near the 24 hour queue limit? If it reaches that, it can guillotine the job and no error message appears.

Regards
Willie

comment:11 Changed 6 months ago by apm

No, it shouldn't get near 24 hours: right now it is doing about a month per hour, so I expect just over 12 hours per cycle.

Alex

comment:12 Changed 4 months ago by willie

  • Status changed from accepted to new

Hi Alex,

I'm going to close this ticket now, since the bug hasn't re-appeared. If it does occur again, you can re-open the ticket or create a new one. Be sure not to modify the suite or output files in any way so we can get a more detailed view of the issue.

Regards
Willie

comment:13 Changed 4 months ago by willie

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.