Opened 4 months ago

Closed 2 weeks ago

#3427 closed help (completed)

Cylc with Remote Acess from PumaTest

Reported by: luciana Owned by: ros
Component: Rose/Cylc Keywords: Cylc, Remote Acess, PumaTest, Archer2
Cc: Platform: ARCHER2
UM Version:

Description

Good afternoon.

I'm still struggling to make a cylc suite work with remote access.

From PumaTest?:

—-

$ hostname
pumatest.nerc.ac.uk
$ cylc —version
7.8.1

Suite copied from: /home/fcm/cylc-7.8.1/tests/remote/basic
Testing directory: /home/luciana/test-pumatest
export CYLC_TEST_TASK_HOST="xfer1.jasmin.ac.uk"
export CYLC_TEST_TASK_OWNER="lucy"

—-

$ cylc register test-pumatest suite.rc

2020-11-17T14:44:22Z WARNING - deprecated items were automatically upgraded in 'user config':
2020-11-17T14:44:22Z WARNING - * (6.11.0) [state dump rolling archive length] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.0.0) [pyro][base port] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.0.0) [pyro][maximum number of ports] → [communication][maximum number of ports] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.0.0) [pyro][ports directory] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.0.0) [pyro] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.0.0) [authentication][hashes] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.0.0) [authentication][scan hash] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.0.0) [execution polling intervals] → [hosts][localhost][execution polling intervals] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.0.0) [submission polling intervals] → [hosts][localhost][submission polling intervals] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][localhost][remote shell template] → [hosts][localhost][ssh command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][login\w*.archer.ac.uk][remote shell template] → [hosts][login\w*.archer.ac.uk][ssh command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][jasmin-xfer\d*.ceda.ac.uk][remote shell template] → [hosts][jasmin-xfer\d*.ceda.ac.uk][ssh command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][jasmin-sci\d*.ceda.ac.uk][remote shell template] → [hosts][jasmin-sci\d*.ceda.ac.uk][ssh command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][eddie3.ecdf.ed.ac.uk][remote shell template] → [hosts][eddie3.ecdf.ed.ac.uk][ssh command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][dtn02.rdf.ac.uk][remote shell template] → [hosts][dtn02.rdf.ac.uk][ssh command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][localhost][remote copy template] → [hosts][localhost][scp command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][login\w*.archer.ac.uk][remote copy template] → [hosts][login\w*.archer.ac.uk][scp command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][jasmin-xfer\d*.ceda.ac.uk][remote copy template] → [hosts][jasmin-xfer\d*.ceda.ac.uk][scp command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][jasmin-sci\d*.ceda.ac.uk][remote copy template] → [hosts][jasmin-sci\d*.ceda.ac.uk][scp command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][eddie3.ecdf.ed.ac.uk][remote copy template] → [hosts][eddie3.ecdf.ed.ac.uk][scp command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.3.1) [hosts][dtn02.rdf.ac.uk][remote copy template] → [hosts][dtn02.rdf.ac.uk][scp command] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][localhost][remote tail command template] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][login\w*.archer.ac.uk][remote tail command template] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][jasmin-xfer\d*.ceda.ac.uk][remote tail command template] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][jasmin-sci\d*.ceda.ac.uk][remote tail command template] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][eddie3.ecdf.ed.ac.uk][remote tail command template] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][dtn02.rdf.ac.uk][remote tail command template] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][localhost][local tail command template] → [hosts][localhost][tail command template] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][login\w*.archer.ac.uk][local tail command template] → [hosts][login\w*.archer.ac.uk][tail command template] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][jasmin-xfer\d*.ceda.ac.uk][local tail command template] → [hosts][jasmin-xfer\d*.ceda.ac.uk][tail command template] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][jasmin-sci\d*.ceda.ac.uk][local tail command template] → [hosts][jasmin-sci\d*.ceda.ac.uk][tail command template] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][eddie3.ecdf.ed.ac.uk][local tail command template] → [hosts][eddie3.ecdf.ed.ac.uk][tail command template] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.6.0) [hosts][dtn02.rdf.ac.uk][local tail command template] → [hosts][dtn02.rdf.ac.uk][tail command template] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.8.0) [suite host scanning] → [suite servers] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.8.0) [suite servers][hosts] → [suite servers][scan hosts] - value unchanged
2020-11-17T14:44:22Z WARNING - * (7.8.0) [suite logging][roll over at start-up] - DELETED (OBSOLETE)
2020-11-17T14:44:22Z ERROR - bad user config /home/luciana/.cylc/global.rc
Illegal item: [task events]reset timer
$

—-

Kind regards.
Luciana.

Change History (21)

comment:1 Changed 4 months ago by ros

Hi Luciana,

Please remove the file /home/luciana/.cylc/global.rc and try again.

Regards,
Ros.

comment:2 Changed 4 months ago by ros

P.S. The environ vars are CYLC_TEST_HOST and CYLC_TEST_OWNER not CYLC_TEST_TASK_*. The suite works fine for me.

comment:3 Changed 4 months ago by luciana

Dear Ros.

Thank you for your support. It seems to be working, and now I can carry on further tests.

Kind regards.

Luciana.

comment:4 Changed 3 months ago by luciana

Dear Ros.

I proceeded with my tests and they're not working. I'm using pumatest.nerc.ac.uk and trying to access xfer1.jasmin.ac.uk as a remote server. I do have access from one machine to the other without a passphrase.

The job appears in xfer1 (task [ [ foo ] ] ), but the suite doesn't finish and the task [ [ bar ] ] is not completed anywhere. I've tried to keep just the task [ [ foo ] ] (test s6), but the suite keeps running forever.

The simplified test in /home/fcm/cylc-7.8.1/tests/tutorial/oneoff doesn't even appear in the remote host (test s3).

I've added some information that might be useful to you in helping me.

Kind regards.

Luciana.

—-

$ hostname
pumatest.nerc.ac.uk

Directory: test-pumatest
(Suite copied from /home/fcm/cylc-7.8.1/tests/remote/basic)

$ cylc scan
test-pumatest luciana@…:43046

$ pwd
/home/luciana/cylc-run/test-pumatest/log/suite

$ more log
2020-11-18T12:58:52Z INFO - Suite server: url=https://pumatest.nerc.ac.uk:43046/ pid=10577
2020-11-18T12:58:52Z INFO - Run: (re)start=0 log=1
2020-11-18T12:58:52Z INFO - Cylc version: 7.8.1
2020-11-18T12:58:52Z INFO - Run mode: live
2020-11-18T12:58:52Z INFO - Initial point: 1
2020-11-18T12:58:52Z INFO - Final point: 1
2020-11-18T12:58:52Z INFO - Cold Start 1
2020-11-18T12:58:54Z INFO - [foo.1] -submit-num=1, owner@host=lucy@…
2020-11-18T12:58:56Z INFO - [foo.1] -(current:ready) submitted at 2020-11-18T12:58:56Z
2020-11-18T12:58:56Z INFO - [foo.1] -health check settings: submission timeout=None

$ pwd
/home/luciana/cylc-run/test-pumatest/log/job/1/foo/01

$ more job-activity.log
[jobs-submit ret_code] 0
[jobs-submit out] 2020-11-18T12:58:56Z|1/foo/01|0|17909
(lucy@…) 2020-11-18T12:58:56Z [STDOUT] 17909

—-

[lucy@xfer1 ~]$ hostname
xfer1.jasmin.ac.uk

[lucy@xfer1 test-pumatest]$ pwd
/home/users/lucy/cylc-run/test-pumatest

[lucy@xfer1 01]$ more job.err
2020-11-18T12:58:56Z WARNING - Message send failed, try 1 of 7: Cannot connect:
https://pumatest.nerc.ac.uk:43046/put_messages: <urlopen error [Errno 113] No ro
ute to host>

….

retry in 5.0 seconds, timeout is 30.0

2020-11-18T12:59:58Z WARNING - Message send failed, try 7 of 7: Cannot connect:
https://pumatest.nerc.ac.uk:43046/put_messages: <urlopen error [Errno 113] No ro
ute to host>

[lucy@xfer1 01]$ more job.out
Suite : test-pumatest
Task Job : 1/foo/01 (try 1)
User@Host: lucy@…
2020-11-18T12:58:56Z INFO - started
2020-11-18T12:59:27Z INFO - succeeded

[lucy@xfer1 01]$ more job.status
CYLC_BATCH_SYS_NAME=background
CYLC_BATCH_SYS_JOB_ID=17909
CYLC_BATCH_SYS_JOB_SUBMIT_TIME=2020-11-18T12:58:56Z
CYLC_JOB_PID=17909
CYLC_JOB_INIT_TIME=2020-11-18T12:58:56Z
CYLC_JOB_EXIT=SUCCEEDED
CYLC_JOB_EXIT_TIME=2020-11-18T12:59:27Z

comment:5 Changed 3 months ago by luciana

Hello.

I'm reading back 7.18. Remote Tasks and we have:

A task remote account must satisfy several requirements:

Network settings must allow communication back from the remote task job to the suite, either by network ports or ssh, unless the last-resort one way task polling communication method is used.

It looks like xfer1 is trying to connect with pumatest, without success. When I try to establish that connection in the terminal, I fail. I've copied the ssh keys, started the ssh agent, but when I try to add it, it says I have a bad passphrase (which is not the case because it works from my computer to puma).

[lucy@xfer1 .ssh]$ ssh -Y luciana@…
Permission denied (publickey).

I'm back to just guessing now.

Kind regards.

Luciana.

comment:6 Changed 3 months ago by ros

Hi Luciana,

Addressing your last comment first. You don't need to be able to ssh from xfer1 to pumatest. Those communication warnings can be ignored. Cylc is set up on pumatest so that it employs the last-resort method stated above i.e. one way task polling communication method from pumatest to xfer1.

I think the suite problem is a mismatch with cylc versions; you're environment is setting up the path to cylc-6.11.4 on xfer1 which certainly won't work.

Please comment out the 2 PATH exports in your ~/.bash_profile on JASMIN and replace with the path to cylc-7.8.1:

export PATH=/home/users/rshatcher/software/bin:$PATH

Log out of xfer1 and back in again and check that you've got the correct version in your environment by running cylc --version.

Give that a try.
Cheers,
Ros.

comment:7 Changed 3 months ago by luciana

Dear Ros.

It didn't work. And now I have one extra problem: cylc stop is not killing the suites.

Both machines now have cylc 7.8.1 (xfer1 had cylc 7.8.6). The outcome of my last test, s7, is the same as the old ones.

Kind regards.

Luciana.

comment:8 Changed 3 months ago by ros

Hi Luciana,

The foo task for s7 has worked see /home/users/lucy/cylc-run/s7/log/job/1/foo/01/job.status on xfer1. You will also see that the host.txt file has correctly been created in the corresponding task work directory.

rshatcher@xfer1$ ls /home/users/lucy/cylc-run/s7/work/1/foo
host.txt

The suite s7 works fine for me and updates to succeeded in the GUI automatically for me. How long are you leaving the suite running before you're trying to stop it? From the suite log file it's looking like the stop command is being issued after only a couple of minutes. Cylc only polls the remote tasks every 5 minutes so it can be up to 5 minutes after the remote task has finished before cylc status/GUI will update on pumatest. Either try leaving the suite running for 5 minutes or manually poll the task.

Regards,
Ros.


comment:9 Changed 3 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

P.s. In answer to your cylc stop question: By default cylc stop will only shutdown the suite after the current active tasks have finished running. I believe your suite still has running tasks when you are issuing the stop command and so it will wait until that task has finished before shutting the suite down. See cylc stop --help for the full list of options.

comment:10 Changed 3 months ago by luciana

Dear Ros.

I think we need to take a step back.

The suite s7 works fine for me and updates to succeeded in the GUI automatically for me.

⇒ It doesn't work for me in the command line. I'm not using GUI; I'm not supposed to.

As I mentioned before:

The job appears in xfer1 (task [ [ foo ] ] ), but the suite doesn't finish and the task [ [ bar ] ] is not completed anywhere. I've tried to keep just the task [ [ foo ] ] (test s6), but the suite keeps running forever.

The outcome of my last test, s7, is the same as the old ones.

So I'm aware [ [ foo ] ] is there. Where is [ [ bar ] ] for the suites that have [ [ bar ] ]? Why the suites without bar don't finalise?

How long are you leaving the suite running before you're trying to stop it?

⇒ The suites s1 to s6 were running for more than 24h, and the suite test-pumatest is running since 2020-11-25T16:11:21Z. All the suites don't do anything other than the communication (and I add a touch a.txt to see something happening), so I still have no clue why they haven't finished.

I was able to stop the suites s1 and s2, using cylc stop -k and cylc stop -n, respectively. But I want to know why the suites are not finishing properly. The suites that have a [ [ bar ] ] task never get to this task. In the suite s7, I removed this task and, even then, the suite is still running when I call cylc scan.

Either try leaving the suite running for 5 minutes or manually poll the task.

What do you mean with manually poll the task?

—-

Kind regards.

Luciana.

comment:11 Changed 3 months ago by ros

Hi Luciana,

I've just tweaked the central config file as the automatic task polling wasn't working properly hence why none of your task statuses are being updated and the suites are getting stuck.

Please try running ONE of your suites again and let me know which one so I can check to make sure it is now polling at 5 minute intervals correctly.

Regards,
Ros.

comment:12 Changed 3 months ago by luciana

Hi Ros.

It worked! Thank you! :D

Kind regards.

Luciana.

comment:13 Changed 3 months ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

comment:14 Changed 8 weeks ago by luciana

  • Resolution fixed deleted
  • Status changed from closed to reopened

Hello.

I'm back to work with Cylc and remote access.

It's working from pumatest to xfer1.jasmin.ac.uk (simple example).

It's NOT working from puma to xfer1.jasmin.ac.uk (the suite never finishes).
Login and remote access are NOT working from puma/pumatest to archer/archer2 (permission denied, command rejected by policy, not in authorised list, authenticated with partial success, etc.).

Are some of these machines open to working together (other than pumatest and jasmin)? Is there another server in jasmin that I can play with, at least?

Kind regards.

Luciana.

comment:15 Changed 8 weeks ago by ros

Hi Luciana,

To switch between using puma and pumatest is not trivial. As you are aware the 2 machines have different versions of cylc running and you need to ensure that the matching version is then picked up on JASMIN. You have things working from pumatest, so please stick to running from this machine.

pumatest → Archer/Archer2 will be possible when Archer have fixed an issue that is preventing submission of pure cylc suites at the moment. I will chase them up today.

Interactive login from puma/pumatest to Archer/Archer2 is not possible. It only allows a subset of commands to be run which includes all commands run by Rose/Cylc suites. See http://cms.ncas.ac.uk/wiki/ArcherSshAgent point 5 for the response you should get when you attempt to login to Archer/Archer2.

Regards,
Ros.

comment:16 Changed 8 weeks ago by luciana

Dear Ros.

I'm happy to keep working from pumatest. I just wanted more external options than just Jasmin. I'll wait for Archer to be ready.

About pumatest, is there a way to improve the terminal options? We don't have autocomplete, tabs, home/end, nothing seems to work there.

Kind regards.

Luciana.

comment:17 Changed 8 weeks ago by ros

Hi Luciana,

Totally understand. I'll let you know when Archer/2 is fixed.

As for terminal options, my knowledge is very limited. Autocomplete using tab works for me ok. I've not set anything special up for that to my knowledge. Otherwise, from a quick google, I think you may be able to configure the setup you want in a ~/.inputrc file.

Cheers,
Ros.

comment:18 Changed 5 weeks ago by ros

Hi Luciana,

Just to let you know that pure cylc suites should now submit ok to ARCHER2.

Before submitting a suite check that you connection to ARCHER2 is correct (see point 5 on page http://cms.ncas.ac.uk/wiki/Archer2/SshAgentSetup)

Cheers,
Ros.

comment:19 Changed 3 weeks ago by ros

  • Resolution set to answered
  • Status changed from reopened to closed

comment:20 Changed 3 weeks ago by luciana

  • Keywords PumaTest, Archer2 added; PumaTest removed
  • Platform changed from Other to ARCHER2
  • Resolution answered deleted
  • Status changed from closed to reopened

Hi Ros.

Unfortunately, it's still not working for me. I'm getting Permission denied (publickey) from PUMATEST to ARCHER2. I would guess I didn't get my permissions set in PUMATEST, just in PUMA. But in PUMA, pure Cylc doesn't work. :/

The Cylc suite I tested is in /home/luciana/s4 and I'm copying here the whole ssh -vvv command.

Kind regards.

Luciana.

—-
[CMS edit: ssh trace removed for security]

Last edited 2 weeks ago by ros (previous) (diff)

comment:22 Changed 2 weeks ago by ros

  • Resolution set to completed
  • Status changed from reopened to closed

Moved to #3476 as this is an old ticket and the problem is different to the original.

Note: See TracTickets for help on using tickets.