Opened 4 months ago

Closed 2 weeks ago

Last modified 9 days ago

#3251 closed help (answered)

Restarting suite

Reported by: wcollins Owned by: um_support
Component: Coupled model Keywords: restart
Cc: Platform: Monsoon2
UM Version: 11.2

Description

I have a suite u-bs981 that fails in "coupled". I've tried "Trigger" on the task, but it then fails in the same place.
If I stop the job (stop after killing) and then rose suite-run —restart it takes me straight to the failed state without trying to start it again.
Any suggestions on how to get it restarted would be welcome.

Rank 655 [Fri Apr 24 18:59:32 2020] [c10-2c1s14n2] application called MPI_Abort(comm=0xC4000003, 1) - process 655
Application 103371920 is crashing. ATP analysis proceeding…

atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 07418] [c10-2c1s14n2] [Fri Apr 24 19:03:33 2020] PE RANK 648 exit signal Aborted
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
[NID 07418] 2020-04-24 19:03:33 Apid 103371920: initiated application termination
[FAIL] run_model # return-code=137
2020-04-24T19:03:53Z CRITICAL - failed/EXIT

Change History (5)

comment:1 Changed 3 months ago by grenville

The ocean model has failed with:

==⇒>> : E R R O R

===========

stpctl: the zonal velocity is larger than 20 m/s

This is a common error trap for NEMO - did you change working configuration?

Grenville

comment:2 Changed 3 months ago by wcollins

Hi Grenville,
Thanks I hadn't spotted the ocean error message. It isn't in any of the log files. How do I find it?

I haven't changed anything - this is 32 years into a 40 year run. It just suddenly stopped. And the parallel run with changed methane (u-bs983) that I started at the same time as u-bs981 sailed happily through this point and finished all 40 years.

Any ideas on how to restart? It doesn't have to be bit comparable with anything. In the old days I would change the number of convection calls to get round a grid point storm.
Thanks,
Bill.

comment:3 Changed 3 months ago by grenville

Bill

Sorry for the delay. NEMO output is in /home/d02/wcolli/cylc-run/u-bs981/work/20461001T0000Z/coupled/ocean.output

You could try changing the number of convection calls - make the change in the suite, save it. At the Monsoon command line

rose suite-run --restart

, then retrigger the failed task. If that fails, you might perturb the atmosphere start file.

Grenville

comment:4 Changed 2 weeks ago by grenville

  • Resolution set to answered
  • Status changed from new to closed

Bill

I'm assuming this was fixed.

Grenville

comment:5 Changed 9 days ago by wcollins

Yes, thank you. Changing the number of convection calls worked.
Bill

Note: See TracTickets for help on using tickets.