Opened 6 months ago

Closed 6 months ago

#3352 closed help (fixed)

Technical failure possibly?

Reported by: charlie Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: NEXCS
UM Version: 10.7

Description

Hi,

Sorry to bother you, but one of my suites (bv963) failed over the weekend, giving me the error below. I am fairly certain (at least I hope) that this is a technical, rather than scientific, failure, because this particular suite has currently run for well over 100 years without any problems.

_pmiu_daemon(SIGCHLD): [NID 07624] [c11-2c2s2n0] [Mon Aug 24 12:11:58 2020] PE RANK 1009 exit signal Aborted
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
[NID 07624] 2020-08-24 12:11:58 Apid 113542160: initiated application termination
[FAIL] run_model # return-code=137
2020-08-24T12:12:08Z CRITICAL - failed/EXIT

What does this error mean?

Thanks,

Charlie

Change History (5)

comment:1 Changed 6 months ago by dcase

Charlie,

I think that your model has crashed. Look at the bottom of the ocean output file here: /home/d05/cwilliams/cylc-run/u-bv963/work/23950101T0000Z/coupled/ocean.output

You can probably get it going with the usual methods (see here: https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral).

Good luck,

Dave

comment:2 Changed 6 months ago by charlie

Hi Dave,

Sorry for the delay. So I have now restarted it, when the perturbation to the atmosphere as you suggested, so hopefully it will get past blowup point. Rather than sticking with the same suite, I copied it to a new one (u-bx150) and restarted from the previous year, so not to miss out on the output from the problematic year.

However, I'm still a bit unclear as to why this happened. As I said, this run has gone for over 100 years, 117 to be precise, before this blowup, and has been perfectly stable and without any problems until this point. Why would it suddenly now experience a grid point storm, which is what the error message suggests? Moreover, according to the error message the zonal velocity is greater than 20 m/s point at 118,289,10 (i,j,k), but I have just rebuilt the latest NEMO restart dump (at /home/d05/cwilliams/cylc-run/u-bv963/share/data/History_Data/NEMOhist/bv963o_23950201_restart.nc) and this particular point looks fine. There are no points in the zonal wind field in this dump that are 3.5340E+06, as the error message suggests. Am I looking at the wrong dump?

Charlie

comment:3 Changed 6 months ago by dcase

To be honest, I don't know why the models crash most of the time. They are numerical models, with lots of coupled approximations, and finite time-steps, so they can run for a long time and then suddenly fall over. You see the same points of failure each time, but this is just the subroutine which notices the error, not the piece of the model which first became numerically unstable.

There are things to do which may increase the stability (decreasing time step perhaps), or you can give the model a tiny tweak and restart (as you're doing), but crashing is a bit of an occupational hazard in numerical solutions, and I have little to offer beyond work-arounds.

If your model was unsound it would've crashed straight away, so I wouldn't worry too much if I were you.

comment:4 Changed 6 months ago by charlie

Okay, many thanks, understood. The perturbation appears to have worked, and it has now got well past the blowup, so hopefully it won't happen again. Thanks for your help, I will close the ticket.

Charlie

comment:5 Changed 6 months ago by charlie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.