Opened 6 months ago

Closed 6 months ago

#3359 closed help (answered)

UM runs stuck on ARCHER

Reported by: pmcguire Owned by: um_support
Component: UM Model Keywords: UM
Cc: Platform: ARCHER
UM Version: 11.5

Description

Hi CMS Helpdesk
My UM runs seem to be stuck on ARCHER. I have 3 runs going.

For one of them, u-bw963, it appeared in the Cylc GUI that my atmos_main run for 1992 had finished 2 days ago, but the icon was still green like it was running. The log file said it had finished. So I changed the state manually from running to succeeded. That didn't help much. It tried to submit posproc, but the submit failed.

So I stopped the job, and then did a rose suite-run --restart, but I get an error:

[FAIL] ssh -oBatchMode=yes login7.archer.ac.uk bash —login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ —name=u-bw963\ —run=restart\ —remote=uuid=dc8f63a3-fe23-46ab-898d-4e76af47812d,root-dir=$DATADIR\' # return-code=255, stderr=

any suggestions?
Patrick

Change History (11)

comment:1 Changed 6 months ago by dcase

The first things that I'd check would be ssh connection and disk space (on both computers).

Presumably these are all ok?

comment:2 Changed 6 months ago by pmcguire

I checked the disk space a couple of days ago.
For login7, we can't ssh from puma→login7 in normal times.
Patrick

comment:3 Changed 6 months ago by pmcguire

Yes, I just checked my disk space quota again on SAFE → ARCHER, and everything is fine there.
Patrick

comment:4 Changed 6 months ago by pmcguire

How do I check the ssh connection to login7, if we're not supposed to be able to ssh from puma→login7 anyways?
Patrick

comment:5 Changed 6 months ago by ros

Hi Patrick,

Just ssh from puma to login7.archer.ac.uk in the normal way. If your ssh isn't set up correctly you'll get a permission denied message. If it's ok you'll see the message "Command rejected - not on allowed list" or similar - I can't remember the exact message.

Cheers,
Ros.

comment:6 Changed 6 months ago by ros

P.S. You'll see error messages in log/job/err.

comment:7 Changed 6 months ago by pmcguire

Thanks, Ros:
These are the error messages that I get:
pmcguire@puma:~> ssh login7.archer.ac.uk
Enter passphrase for key '/home/pmcguire/.ssh/id_rsa_archerum':
PTY allocation request failed on channel 0
Comand rejected by policy. Not in authorised list
Connection to login7.archer.ac.uk closed.

comment:8 Changed 6 months ago by pmcguire

Hi Ros:
So I guess that means it's OK? (see the last comment).
The error message when I do a rose suite-run --restart with ssh -oBatchMode=yes login7.archer.ac.uk is 'Permission Denied'.
Patrick

comment:9 Changed 6 months ago by pmcguire

Hi Ros:
It seems to be working now.
Not sure exactly what I changed.
But I did restart my ssh agent for archerum .

But I was getting the message

PTY allocation request failed on channel 0
Comand rejected by policy. Not in authorised list 

even before I restarted my ssh agent.
Patrick

comment:10 Changed 6 months ago by grenville

Patrick

You'll need the archerum key in an agent for Rose/Cylc?

Grenville

comment:11 Changed 6 months ago by grenville

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.