Opened 9 years ago

Closed 9 years ago

#776 closed help (fixed)

Phase 3 rerun of Phase 2b job now generates NaNs

Reported by: kjp Owned by: willie
Component: UM Model Keywords: compiler optimisation
Cc: Platform:
UM Version: 6.1

Description

My job xghkw is a phase 3 rerun of a phase 2b job xghke: 4km LAM with PV tracers. While it ran fine on 2b, it now produces NaNs? in both the tracers output fields and those variables that have no dependency on the tracer values. A 12km version of the same thing (xghkz) runs fine on phase 3 as does a 4km version without the tracer calculations (xghkx). I've tried versions with reduced timesteps, optimised 4km parameters, an increased number of cores and without tracers being output in the STASH but to no avail.

Change History (9)

comment:1 Changed 9 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Kevin,

This looks like compiler optimisation problem. We are now using the Cray compiler rather than Pathscale. You need to add a compiler override file for the model: Sub-model Independent > Compilation and Modification > User defined compile option. The override file should contain one line:

@fort FCOM_OPTIM=-O0

This reduces the optimisation, making the model run more slowly. If this works, we could then look into narrowing down the decks that need lower optimisation.

Regards,

Willie

comment:2 Changed 9 years ago by kjp

This seems to have worked. xgvuw (4km with PV and no optimisation) now produces numerical output. I was rather concerned that the optimiser was so fierce that it would do this so I reran a 12km standard job with no tracers. The outputs with and without the optimiser (xgvuy v xgvuv), while similar, do differ in detail and I was wondering about the implications that has for bit reproducibility.

comment:3 Changed 9 years ago by willie

  • Keywords compiler optimisation added

Hi Kevin,

The results are unlikely to bit compare between Phase2b and Phase3 since different computer chips and code are involved. On Phase3, changing the optimisation could destroy bit reproducibility of the results - different code is created.

Regards,

Willie

comment:4 Changed 9 years ago by kjp

I'm afraid I spoke too soon. Switching off the optimisation improves but doesn't solve the issue. xgvuw does produce output for the first couple of model hours now but then generates NaNs? in both the tracer fields and variables like u,v. The identical job without the tracer code (xgvuu) runs out to completion at 3 days.

comment:5 Changed 9 years ago by willie

Hi Kevin,

I see in the .leave file that at time step 25, the "RHS zero so GCR( 2 ) not needed". This indicates that the model is unstable. You could try halving the time step from 5 minutes to 2.5 minutes and trying again (Atmosphere > Scientific Sections > Model time stepping).

Regards,

Willie

comment:6 Changed 9 years ago by kjp

I did try that, at Jeffrey's suggestion, in job xgvua and also with the optimised 4km settings in job xgvub before switching off the optimiser. As far as I understand it, the tracer code should just be calculating and storing PV from the model fields after various processes occur. The model fields themselves should be unaffected by the mod. This seems to be the case examining plots by eye before they go NaN. Since xgvuu runs, I can't see how xgvuw can fail unless there's a memory leak out of the PV arrays perhaps. I will try this again however and see what happens.

comment:7 Changed 9 years ago by kjp

No joy there I'm afraid. In fact, it fails earlier. Looking through the output files that are OK, the two cases produce similar but not identical numbers. Any differences are normally very small
(5th SF) but the minumum x-wind at z=10, t=0.2083 for example are -23.1742 v -23.3405. If the code is bit comparable with and without PV tracers, I can't see how that can happen. Is there a way to test for arrays written out of bounds with a debug option?

comment:8 Changed 9 years ago by willie

Hi Kevin,

The significant differences between xghkw and xghke are

  • phase2b vs phase 3
  • the start data and LBC's differ

To get the sharpest possible test, you could do a run on phase 3 with exactly the same data as for phase 2b.

As for array bounds checking, you could add "-R b" to the FCOM_OPTIM line.

Regards,

Willie

comment:9 Changed 9 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.