Opened 3 months ago

Closed 3 months ago

#3244 closed help (fixed)

pptransfer broken pipe during rsync

Reported by: m.couldrey Owned by: um_support
Component: Archiving Keywords: pptransfer, rsync, jasmin
Cc: Platform: NEXCS
UM Version:

Description

Hi CMS

I've not had any luck trying to restart a failing PPTRANSFER job for a few days now (suite u-bq244)
My job.err
~/cylc-run/u-bq244/log/job/19171001T0000Z/pptransfer/NN/job.err
shows the following:

[WARN] [SUBPROCESS]: Command: rsync -av —stats —rsync-path=mkdir -p /gws/nopw/j04/rdf_migrate_vol2/mpc18/NEXCS_OUTPUT/u-bq244/19171001T0000Z && rsync /projects/nexcs-n02/macou/u-bq244/19171001T0000Z/ jasmin-xfer2.ceda.ac.uk:/gws/nopw/j04/rdf_migrate_vol2/mpc18/NEXCS_OUTPUT/u-bq244/19171001T0000Z
[SUBPROCESS]: Error = 255:

Access to this system is monitored and restricted to
authorised users. If you do not have authorisation
to use this system, you should not proceed beyond
this point and should disconnect immediately.

Unauthorised use could lead to prosecution.

(See also - http://www.stfc.ac.uk/aup)

Write failed: Broken pipe
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(641) [sender=3.0.4]

[WARN] Transfer command failed: rsync -av —stats —rsync-path="mkdir -p /gws/nopw/j04/rdf_migrate_vol2/mpc18/NEXCS_OUTPUT/u-bq244/19171001T0000Z && rsync" /projects/nexcs-n02/macou/u-bq244/19171001T0000Z/ jasmin-xfer2.ceda.ac.uk:/gws/nopw/j04/rdf_migrate_vol2/mpc18/NEXCS_OUTPUT/u-bq244/19171001T0000Z
[ERROR] transfer.py: Unknown Error - Return Code=255

The rsync command keeps failing. The beginning of the issue coincided with the nexcs-n02 projects space filling up last week, but I've cleared out a couple of TB since then and the job never makes it through. Not sure what's up, any help would be much appreciated!

Change History (8)

comment:1 Changed 3 months ago by grenville

Matt

Don't you have your own gws now?

Grenville

comment:2 Changed 3 months ago by m.couldrey

Hi Grenville

Yes, this experiment was started before that GWS was available and is still finishing its final years. I'm in the process of transferring output from my experiments over to the fafmip GWS at the moment, but this one hadn't finished running yet (given that it's at year 1917, it only has 2 more years to run, a day or two of compute-time. Some of my output has moved across so you'll see my usage of rdf-migrate2 shrink over the coming days, and I'll start transferring this simulation across as well.

Cheers
Matt

comment:3 Changed 3 months ago by grenville

Have you tried transferring directly to the new gws?

comment:4 Changed 3 months ago by m.couldrey

Thanks for the tip. I've switched the transfer directory to be fafmip_output_vol1 and reloaded and retriggered the suite. Let's see if that goes through. Cheers Grenville!

comment:5 Changed 3 months ago by m.couldrey

Hi Grenville

I reloaded the transfer to the fafmip_output gws and it began ok yesterday, but also stopped again with a similar 'broken pipe' error.

bq244a.p719171211.pp
bq244a.p819170921.pp
bq244a.p819171001.pp
Write failed: Broken pipe
rsync: writefd_unbuffered failed to write 4 bytes [sender]: Broken pipe (32)
rsync: connection unexpectedly closed (696 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(641) [sender=3.0.4]

[WARN] Transfer command failed: rsync -av —stats —rsync-path="mkdir -p /gws/nopw/j04/fafmip_output_vol1/mpc18/NEXCS_OUTPUT/u-bq244/19171001T0000Z && rsync" /projects/nexcs-n02/macou/u-bq244/19171001T0000Z/ jasmin-xfer2.ceda.ac.uk:/gws/nopw/j04/fafmip_output_vol1/mpc18/NEXCS_OUTPUT/u-bq244/19171001T0000Z
[ERROR] transfer.py: Unknown Error - Return Code=255
[FAIL] Command Terminated

I've tried to retrigger the task for now in case it's just an intermittent connection problem, but I'm not sure what else I can try here. Thanks!
Matt

comment:6 Changed 3 months ago by grenville

Matt

Still clutching at straws - please try using xfer3

change here:
app/postproc/rose-app.conf:remote_host=jasmin-'''xfer2'''.ceda.ac.uk

make sure your .ssh/config understands xfer3

Grenville

comment:7 Changed 3 months ago by m.couldrey

Thanks for the suggestion, Grenville. That seems to have worked- a couple of cycles have now sent output to the GWS via xfer3.
Thank you very much!

comment:8 Changed 3 months ago by m.couldrey

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.