Opened 6 months ago

Last modified 4 days ago

#2020 reopened help

ssh failures on archiving

Reported by: simon.tett Owned by: willie
Priority: normal Component: Archiving
Keywords: Cc:
Platform: ARCHER UM Version: 8.5

Description

Dear helpdesk,

I get the occasional ssh failure (1 in 10 or so of archive attempts) when archiving data. My current work around is for the archiver to ignore errors and continue. The archiver seems to produce partial conversions but no output when this happens. Any suggestions?

ta
Simon

Attachments (2)

test_tett3 (1.3 KB) - added by willie 5 months ago.
Example script
out.log (124 bytes) - added by willie 5 months ago.
test results

Download all attachments as: .zip

Change History (16)

comment:1 Changed 6 months ago by grenville

Simon

I'd like to get to the root of the ssh problem - I'm in contact with ARCHER to try to do that.

Grenville

comment:2 Changed 6 months ago by simon.tett

Dear Grenville,

thanks. Current case on Arhcer (xldsx) is a good example — in 2 years of running I would expect it to produce 60 files. Of those 9 all failed due to, I think, a ssh error.

Failed to archive xldsxa.pa1990oct.ff with status 255.
Failed to archive xldsxa.pm1990dec.ff with status 255.
Failed to archive xldsxa.pm1991feb.ff with status 255.
Failed to archive xldsxa.pm1991aug.ff with status 255.
Failed to archive xldsxa.pm1991sep.ff with status 255.
Failed to archive xldsxa.pm1991oct.ff with status 255.
Failed to archive xldsxa.ps1991son.ff with status 255.
Failed to archive xldsxa.pm1992mar.ff with status 255.
Failed to archive xldsxa.pm1992apr.ff with status 255.

I can hand archive the failures so can work around and my archive code keeps going with such errors… Still a pain!

Simon

Changed 5 months ago by willie

Example script

Changed 5 months ago by willie

test results

comment:3 Changed 5 months ago by willie

Hi Simon,

I have had a look at your setup and can confirm that it is an SSH problem and not anything to do with the UM or archiving. I have reduced your setup to the attached simple problem which shows the effect on ARCHER. Just copy the script to /work and

test_tett3 > out.log 2>&1

I have put in a query to the ARCHER help desk.

Regards
Willie

comment:4 Changed 5 months ago by simon.tett

Hi Willie,

good to know it is not me! I had this problem with the standard archiving system to. I suspect you could simplify the test case even more.

I'll await response from ARCHER team then.

Simon

comment:5 Changed 5 months ago by ros

  • Owner changed from um_support to willie
  • Status changed from new to assigned

comment:6 Changed 5 months ago by willie

Hi Simon,

ARCHER have replied:

I have done some investigation and discussed with the sysadmin team
re: the termination of your PP jobs.  The reason is that the only
modes of access to the PP nodes that is supported are (from
https://www.archer.ac.uk/documentation/user-guide/connecting.php#sec-2.1.2)

1. Via the serial queues
2. Via direct interactive SSH

As a result, processes running on the PP nodes which are not from an
interactive SSH session or a current serial batch job may be
terminated, and this is what is happening in your case.

However, they are aware of the need to solve this issue and will be in further contact with me.

Regards
Willie

comment:7 Changed 4 months ago by willie

Hi Simon,

I now have a solution from ARCHER. It is as simple as changing the SSH options. In your qsserver script you have

export SSHOPT="-n -o UserKnownHostsFile=/dev/null \
-o StrictHostKeyChecking=no -o NumberOfPasswordPrompts=0 \
-c arcfour -i $HOME/.ssh/um_arch -q"

If you replace this with,

export SSHOPT="-t -i $HOME/.ssh/um_arch -q"

the -t option associates a terminal with the process launched on the post processing nodes and this is enough to prevent the ARCHER security poll (every five minutes) from terminating the process.

I have tried this in my test harness and in a three month run of your job - see my xnbkd with the archive on /nerc/n02/n02/wmcginty/archive - and both have been successful.

Regards,
Willie

comment:8 Changed 4 months ago by simon.tett

Hi Willie,

thanks. I'm done for the moment running simulations but will be doing some more soon. I'll modify my archive script then!

Simon

comment:9 Changed 4 months ago by willie

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:10 Changed 3 months ago by simon.tett

  • Resolution fixed deleted
  • Status changed from closed to reopened

HI,

I've modified my archive script as suggested and still have a roughly 1 in ten failure rate on archiving…

Simon

comment:11 Changed 3 months ago by willie

Hi Simon,

Yes, my idealized test script is failing too. I've asked ARCHER if anything has changed.

Regards
Willie

comment:12 Changed 2 months ago by simon.tett

Hi Willie,

any update on this… Worse still I get the occasional job just killed — which then brings down the UM and is a moderate pain to fix!

Simon

comment:13 Changed 2 months ago by willie

Simon,

There's no update yet. There are a few others with the same issue. Perhaps after the ARCHER maintenance is complete there will be a reply.

Regards
Willie

comment:14 Changed 4 days ago by willie

Hi Simon,

There is now a much more reliable solution available. It just involves replacing the "-t" SSH option with "-tt".

Regards
Willie

Note: See TracTickets for help on using tickets.