Opened 2 years ago

Closed 2 years ago

#2272 closed help (fixed)

Restarting Rose suites after PUMA ssh failure

Reported by: pmcjs Owned by: willie
Component: Rose/Cylc Keywords: ssh-agent
Cc: Platform: ARCHER
UM Version: 10.7

Description

Hi CMS,

Overnight my ssh-agent from PUMA to Archer stopped working and I needed to reboot it this morning (by removing ~/.ssh/environment.puma, logging out and back in, and running ssh-add). This now connects to Archer fine.

I had running jobs on Archer, which looked like they completed to the end of the cycle point and were not resubmitted. I followed the instructions on this page: http://cms.ncas.ac.uk/wiki/Puma/ThingsToDoAfterReboot (point 2) but the suite is failing to shutdown and restart. Here is my command line sequence from PUMA:

pmcjs@puma:/home/pmcjs> cd roses/u-aq138
pmcjs@puma:/home/pmcjs/roses/u-aq138> gcylc
pmcjs@puma:/home/pmcjs/roses/u-aq138> rose suite-run --restart
[FAIL] Suite "u-aq138" has running processes on: puma.nerc.ac.uk
[FAIL] Try "rose suite-shutdown --name=u-aq138" first?
pmcjs@puma:/home/pmcjs/roses/u-aq138> rose suite-shutdown --name=u-aq138
Really shutdown u-aq138 at puma.nerc.ac.uk? [y or n (default)] y
pmcjs@puma:/home/pmcjs/roses/u-aq138> rose suite-run --restart
[FAIL] Suite "u-aq138" has running processes on: puma.nerc.ac.uk
[FAIL] Try "rose suite-shutdown --name=u-aq138" first?
pmcjs@puma:/home/pmcjs/roses/u-aq138> ls /home/pmcjs/.cylc/ports/
u-aq138  u-aq174  u-aq175  u-aq176  u-aq177  u-aq178
pmcjs@puma:/home/pmcjs/roses/u-aq138> less /home/pmcjs/.cylc/ports/u-aq138
pmcjs@puma:/home/pmcjs/roses/u-aq138> rm /home/pmcjs/.cylc/ports/u-aq138
pmcjs@puma:/home/pmcjs/roses/u-aq138> rose suite-run --restart
[FAIL] Suite "u-aq138" has running processes on: localhost
[FAIL] Try "rose suite-shutdown --name=u-aq138" first?
pmcjs@puma:/home/pmcjs/roses/u-aq138> rose suite-shutdown --name=u-aq138
Really shutdown u-aq138 at localhost? [y or n (default)] y
"Port file '/home/pmcjs/.cylc/ports/u-aq138' not found - suite not running?."
[FAIL] cylc shutdown u-aq138 --force --host=localhost # return-code=1
pmcjs@puma:/home/pmcjs/roses/u-aq138> rose suite-run --restart
[FAIL] Suite "u-aq138" has running processes on: localhost
[FAIL] Try "rose suite-shutdown --name=u-aq138" first?

I know I could manually shut down the jobs from gcylc and restart from Rose from an intermediate start dump, but I have 6 such jobs so I wonder if there's an easier way?

Thanks,
Chris

Change History (6)

comment:1 Changed 2 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Chris,

PUMA is still up and running, so there is no need to follow the reboot procedure. All that's happened is that your SSH agent died, and you've correctly restarted it. You could leave the jobs to complete naturally if you wish.

Regards
Willie

comment:2 Changed 2 years ago by pmcjs

Hi Willie,

Thanks - but the jobs aren't being resubmitted: they're not showing as queued or running on Archer. I am using monthly CRUNs for my jobs with wallclock time of 1 hour. Looking at the gcylc screen for u-aq174 it says it is running still but the job started at 2:32am, so it's obviously lost communication with Archer.

I will stop and manually restart these jobs from the latest start dump, but just something to note for cases when ssh-agent crashes.

Cheers,
Chris

comment:3 Changed 2 years ago by grenville

Chris

Before doing anything drastic - have you tried simply restarting gcylc? Your agent has no bearing on how that communicates.

Grenville

comment:4 Changed 2 years ago by pmcjs

Hi Grenville,

Yes I did - the tasks are still showing as they did before (i.e. running when they aren't in the case of u-aq174). What I suspect has happened is that when the ssh-agent died the CRUNs were not able to be sent from PUMA to Archer and they therefore have not been submitted. On the return leg the run status information has not been updated as PUMA can't communicate with Archer using the failed ssh-agent so the tasks on gcylc are still showing in the status they were when the agent died.

Cheers,
Chris

comment:5 Changed 2 years ago by pmcjs

I just restarted my runs manually in the end. You can close the ticket now.

comment:6 Changed 2 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.