Opened 3 months ago

Closed 3 months ago

#2258 closed help (wontfix)

N216 job crashes when IAU is on

Reported by: sam89 Owned by: willie
Priority: normal Component: UM Model
Keywords: Cc:
Platform: Monsoon2 UM Version: 8.2

Description

Created on behalf of Sam Clark:

I have taken a standard N216 forecast job (xnpec) and switched IAU on. It now crashes with the following message,

Rank 32 [Thu Aug 31 07:08:27 2017] [c1-0c2s0n1] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 32
Application 1214082 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 32 starting:
  start_thread@pthread_create.c:301
  _new_slave_entry@0x15dee0b
  coex2__cray$mt$p0001@coex2.f90:126
  ereport64$ereport_mod_@ereport_mod.f90:102
  gc_abort_@gc_abort.F90:136
  mpl_abort_@mpl_abort.F90:46
  pmpi_abort@0x15ebafc
  MPI_Abort@0x16157a4
  MPID_Abort@0x163f8b1
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 32 done
Process died with signal 6: 'Aborted'

Run id is xnpic
Leave file is /home/d02/saclar/output/xnpic000.xnpic.d17243.t070618.leave

Change History (3)

comment:1 Changed 3 months ago by sam89

Hi Willie

Thanks for creating this ticket. I've been trying to investigate this issue. I searched previous tickets and saw it can be related to the packing of the STASH so I switched it all to unpacked but this has not fixed the issue. I believe it can also be related to NaNs? in the data. I looked at the .astart file and I cannot see an issue with that and the data in the IAU file I am using seems fine from what I can see so it must be that the IAU file is causing output with strange data for some reason.

All I basically did since you created this ticket was deleted all the STASH and now just output U and V unpacked and I also dump out at 3hrs unpacked. I also ran it without the IAU file and can see that it runs fine if the IAU file is not included so it is a definite issue with the IAU file as you have identified.

I am wondering if it is an issue with using a Global start dump when the IAU file is N216 resolution…I am unsure if this would cause issues though since we have reconfigured the Global start dump to N216 resolution before running the job with the IAU file included.

Since this is an ensemble perturbation IAU file I tried another file since there are 24 members. The same problem arose so I don't think its an issue with the individual file itself.

Again though the pe_output files are not very helpful.

Thanks again for your help,

Sam

comment:2 Changed 3 months ago by willie

Hi Sam,

I've run various experiments (xnpx) to try to get to the bottom of this. If IAU is switched off the job is successful. When on, it crashes because the model becomes unstable at time step 3 and this causes the WGDOS packing error (you need to set "flush buffer if run fails" in Section 13 to see this). I then played around with faster time steps, even speeding up the radiation time steps. This had no real effect on the problem. Some times it reports "mid conv went to the top of the model", but this is fundamentally caused by the instability at time step three. Adding IAU diagnostics gave no further clue, but I am not very familiar with these.

Have you had similar jobs work in the past? This seems similar to the unresolved #2234. If you can find a similar working job, then perhaps the same method would work here. Perhaps Adam Clayton at the Met Office could provide further advice, now that an N216 job is being used. I am out of ideas on this, but it would be great to know what the answer is.

Regards
Willie

comment:3 Changed 3 months ago by willie

  • Resolution set to wontfix
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.