#2020 closed help (fixed)
ssh failures on archiving
Reported by: | simon.tett | Owned by: | willie |
---|---|---|---|
Component: | Archiving | Keywords: | |
Cc: | Platform: | ARCHER | |
UM Version: | 8.5 |
Description
Dear helpdesk,
I get the occasional ssh failure (1 in 10 or so of archive attempts) when archiving data. My current work around is for the archiver to ignore errors and continue. The archiver seems to produce partial conversions but no output when this happens. Any suggestions?
ta
Simon
Attachments (2)
Change History (18)
comment:1 Changed 4 years ago by grenville
comment:2 Changed 4 years ago by simon.tett
Dear Grenville,
thanks. Current case on Arhcer (xldsx) is a good example — in 2 years of running I would expect it to produce 60 files. Of those 9 all failed due to, I think, a ssh error.
Failed to archive xldsxa.pa1990oct.ff with status 255.
Failed to archive xldsxa.pm1990dec.ff with status 255.
Failed to archive xldsxa.pm1991feb.ff with status 255.
Failed to archive xldsxa.pm1991aug.ff with status 255.
Failed to archive xldsxa.pm1991sep.ff with status 255.
Failed to archive xldsxa.pm1991oct.ff with status 255.
Failed to archive xldsxa.ps1991son.ff with status 255.
Failed to archive xldsxa.pm1992mar.ff with status 255.
Failed to archive xldsxa.pm1992apr.ff with status 255.
I can hand archive the failures so can work around and my archive code keeps going with such errors… Still a pain!
Simon
comment:3 Changed 4 years ago by willie
Hi Simon,
I have had a look at your setup and can confirm that it is an SSH problem and not anything to do with the UM or archiving. I have reduced your setup to the attached simple problem which shows the effect on ARCHER. Just copy the script to /work and
test_tett3 > out.log 2>&1
I have put in a query to the ARCHER help desk.
Regards
Willie
comment:4 Changed 4 years ago by simon.tett
Hi Willie,
good to know it is not me! I had this problem with the standard archiving system to. I suspect you could simplify the test case even more.
I'll await response from ARCHER team then.
Simon
comment:5 Changed 4 years ago by ros
- Owner changed from um_support to willie
- Status changed from new to assigned
comment:6 Changed 4 years ago by willie
Hi Simon,
ARCHER have replied:
I have done some investigation and discussed with the sysadmin team re: the termination of your PP jobs. The reason is that the only modes of access to the PP nodes that is supported are (from https://www.archer.ac.uk/documentation/user-guide/connecting.php#sec-2.1.2) 1. Via the serial queues 2. Via direct interactive SSH As a result, processes running on the PP nodes which are not from an interactive SSH session or a current serial batch job may be terminated, and this is what is happening in your case.
However, they are aware of the need to solve this issue and will be in further contact with me.
Regards
Willie
comment:7 Changed 4 years ago by willie
Hi Simon,
I now have a solution from ARCHER. It is as simple as changing the SSH options. In your qsserver script you have
export SSHOPT="-n -o UserKnownHostsFile=/dev/null \ -o StrictHostKeyChecking=no -o NumberOfPasswordPrompts=0 \ -c arcfour -i $HOME/.ssh/um_arch -q"
If you replace this with,
export SSHOPT="-t -i $HOME/.ssh/um_arch -q"
the -t option associates a terminal with the process launched on the post processing nodes and this is enough to prevent the ARCHER security poll (every five minutes) from terminating the process.
I have tried this in my test harness and in a three month run of your job - see my xnbkd with the archive on /nerc/n02/n02/wmcginty/archive - and both have been successful.
Regards,
Willie
comment:8 Changed 4 years ago by simon.tett
Hi Willie,
thanks. I'm done for the moment running simulations but will be doing some more soon. I'll modify my archive script then!
Simon
comment:9 Changed 4 years ago by willie
- Resolution set to fixed
- Status changed from assigned to closed
comment:10 Changed 4 years ago by simon.tett
- Resolution fixed deleted
- Status changed from closed to reopened
HI,
I've modified my archive script as suggested and still have a roughly 1 in ten failure rate on archiving…
Simon
comment:11 Changed 4 years ago by willie
Hi Simon,
Yes, my idealized test script is failing too. I've asked ARCHER if anything has changed.
Regards
Willie
comment:12 Changed 4 years ago by simon.tett
Hi Willie,
any update on this… Worse still I get the occasional job just killed — which then brings down the UM and is a moderate pain to fix!
Simon
comment:13 Changed 4 years ago by willie
Simon,
There's no update yet. There are a few others with the same issue. Perhaps after the ARCHER maintenance is complete there will be a reply.
Regards
Willie
comment:14 Changed 4 years ago by willie
Hi Simon,
There is now a much more reliable solution available. It just involves replacing the "-t" SSH option with "-tt".
Regards
Willie
comment:15 Changed 4 years ago by willie
- Resolution set to fixed
- Status changed from reopened to closed
comment:16 Changed 3 years ago by simon.tett
Just to say I modified my code to use -tt and all ran sweetly. No more archive failures!
Thanks for for your help.
Simon
Simon
I'd like to get to the root of the ssh problem - I'm in contact with ARCHER to try to do that.
Grenville