Opened 11 months ago

Closed 9 months ago

#3107 closed help (fixed)

error in PPtransfer

Reported by: xd904476 Owned by: ros
Component: UM Model Keywords: pptransfer
Cc: Platform: ARCHER
UM Version:

Description

Hi,
I am issues when transferring data from the RDF to Jasmin.
The error is not very specific, and I have seen it before:

error: globus_ftp_client: the server responded with an error
500 500-Command failed. : globus_xio: System error in close: Input/output error
500-globus_xio: A system call failed: Input/output error
500 End.

[WARN] Transfer command failed: globus-url-copy -vb -cd -cc 4 -sync file:///nerc/n02/n02/dflocco/archive/u-bo976/ensemble_0/20150101T0000Z/ sshftp://jasmin-xfer2.ceda.ac.uk/gws/nopw/j04/ncas_climate_vol1/users/xd904476/jasmin_roses/u-bo976/ensemble_0/20150101T0000Z/
[ERROR] transfer.py: Unknown Error - Return Code=1
[FAIL] Command Terminated
[FAIL] Terminating PostProc?
[FAIL] transfer.py # return-code=1
Received signal ERR
cylc (scheduler - 2019-12-11T12:35:34Z): CRITICAL Task job script received signal ERR at 2019-12-11T12:35:34Z
cylc (scheduler - 2019-12-11T12:35:34Z): CRITICAL failed at 2019-12-11T12:35:34Z

It used to be the case that when I was manually copying the files that could not be read then I could get on with work. When the files where too many, I would just remove them all on jasmin and retrigger the pptransfer from the guy. Now this method is not working and I don't know why. The amount of data is rather large, but still in the limits I was given.

In the past I was told to remove the -p4 from this command in transfer.py on dtn02, so that I have (line 233):

globus_cmd = 'globus-url-copy -vb -cd -cc 4 -sync'

I've run out of ideas. could you help pls?

Thanks,
Dani

Change History (15)

comment:1 Changed 11 months ago by ros

Hi Dani,

Looks like a possible problem with globus. If retrying doesn't work trying switching to using rsync rather than gridftp and see if that works ok. You can select rsync in the pptransfer panel in the GUI and then do a reload.

Cheers,
Ros.

comment:2 Changed 11 months ago by xd904476

Hi Ros, my guy only allows me to switch the gridftp to false. Will it use rsync instead automatically? I have no panel to choose from.
Thanks,
Dani

comment:3 Changed 11 months ago by ros

Hi Dani,

Yes, sorry I didn't look at the GUI and remembered wrong! :-( Yes just turn gridftp off.

Cheers,
Ros.

comment:4 Changed 11 months ago by xd904476

Thanks. I thought I better ask before messing it up!

just submitted, fingers crossed!

comment:5 Changed 11 months ago by xd904476

Hi, I am getting into troble for this transfer I believe:

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:moose_arch
[WARN] [SUBPROCESS]: Command: rsync -av —stats —rsync-path=mkdir -p /gws/nopw/j04/ncas_climate_vol1/users/xd904476/jasmin_roses/u-bo976/ensemble_8/20150101T0000Z && rsync /nerc/n02/n02/dflocco/archive/u-bo976/ensemble_8/20150101T0000Z/ jasmin-xfer2.ceda.ac.uk:/gws/nopw/j04/ncas_climate_vol1/users/xd904476/jasmin_roses/u-bo976/ensemble_8/20150101T0000Z
[SUBPROCESS]: Error = 12:

Access to this system is monitored and restricted to
authorised users. If you do not have authorisation
to use this system, you should not proceed beyond
this point and should disconnect immediately.

Unauthorised use could lead to prosecution.

(See also - http://www.stfc.ac.uk/aup)

sending incremental file list
./
bo976a.da20151201_00
bo976a.p42015apr.pp
bo976a.p42015aug.pp
bo976a.p42015feb.pp
bo976a.p42015jan.pp
bo976a.p42015jul.pp
bo976a.p42015jun.pp
rsync: connection unexpectedly closed (1874147 bytes received so far) [sender]
rsync error: error in rsync protocol data stream (code 12) at io.c(600) [sender=3.0.6]

[WARN] Transfer command failed: rsync -av —stats —rsync-path="mkdir -p /gws/nopw/j04/ncas_climate_vol1/users/xd904476/jasmin_roses/u-bo976/ensemble_8/20150101T0000Z && rsync" /nerc/n02/n02/dflocco/archive/u-bo976/ensemble_8/20150101T0000Z/ jasmin-xfer2.ceda.ac.uk:/gws/nopw/j04/ncas_climate_vol1/users/xd904476/jasmin_roses/u-bo976/ensemble_8/20150101T0000Z
[ERROR] transfer.py: System Error: Failed to make transfer directory (ReturnCode?=12)
[FAIL] Command Terminated
[FAIL] Terminating PostProc?
[FAIL] transfer.py # return-code=1
Received signal ERR
cylc (scheduler - 2019-12-11T22:15:00Z): CRITICAL Task job script received signal ERR at 2019-12-11T22:15:00Z
cylc (scheduler - 2019-12-11T22:15:00Z): CRITICAL failed at 2019-12-11T22:15:00Z

any suggestions?

thanks,
dani

comment:6 Changed 11 months ago by ros

  • Keywords pptransfer added
  • Owner changed from um_support to ros
  • Platform set to ARCHER
  • Status changed from new to accepted

Hi Dani,

The RDF is down for maintenance today so probably worth just retrying when it's back. The fact that both gridftp & rsync are failing suggests an environment issue or something.

If they still fail then try running the rsync command on the dtn02 command line and see what happens -
for example:

rsync /nerc/n02/n02/dflocco/archive/u-bo976/ensemble_10/20150101T0000Z/ jasmin-xfer2.ceda.ac.uk:/gws/nopw/j04/ncas_climate_vol1/users/xd904476/jasmin_roses/u-bo976/ensemble_10/20150101T0000Z

Cheers,
Ros.

comment:7 Changed 11 months ago by xd904476

Hi Ros,
I have waited for the RDF to come back up and I had to run the suite again with just 3 ensemble members to check on the perturbation, but I got the same error in the middle of the rsync

http://puma.nerc.ac.uk/rose-bush/view/xd904476/u-bo976?&no_fuzzy_time=0&path=log/job/20150101T0000Z/pptransfer_ensemble0/01/job.err

The space is therefore smaller than the 23 ensembles, but I keep getting this "environment" error. Could it be that the files are too numerous or the folder becomes too big? It doesn't happen always at the same destination folder size though.

If I try manually, I can ssh into dtn02 without any problems, but I get this:

xd904476@puma:/home/xd904476>
xd904476@puma:/home/xd904476> ssh dflocco@…
Last login: Sat Dec 14 23:17:50 2019 from puma.nerc.ac.uk
[dflocco@dtn02 ~]$
[dflocco@dtn02 ~]$ rsync /nerc/n02/n02/dflocco/archive/u-bo976/ensemble_10/20150101T0000Z/ jasmin-xfer2.ceda.ac.uk:/gws/nopw/j04/ncas_climate_vol1/users/xd904476/jasmin_roses/u-bo976/ensemble_0/20150101T0000Z

Access to this system is monitored and restricted to
authorised users. If you do not have authorisation
to use this system, you should not proceed beyond
this point and should disconnect immediately.

Unauthorised use could lead to prosecution.

(See also - http://www.stfc.ac.uk/aup)

skipping directory .
[dflocco@dtn02 ~]$

Any ideas? thanks

comment:8 Changed 11 months ago by ros

HI Dani,

Please check that you can ssh from dtn02 to jsamin-xfer2 ok.

Sorry also the rsync command should be rsync -av --stats /nerc.....

Cheers,
Ros.

comment:9 Changed 11 months ago by xd904476

Hi Ros,

I've done all of this but this pptransfer doesn't seem to work properly in any way. I can login from puma to dtn02 and form there to jasmin with no problems.

I have therefore copied the suite and created u-bp862 to run it for 1 month only to see whether there was something stuck. I have set it up as in the attached, but I must have done something wrong because it is running the second coupled task (february), and the files in the rdf archive are only 2. All of this is only to check whether the perturbation is working well, therefore I would like to run only a short time simulation.

comment:10 Changed 11 months ago by ros

Hi Dani,

Please look at your setup, you have the run length set to 1 month but cycling period is 1 day. So you have run only 1 day hence why there is very little to archive. The second coupled task is dated 2nd Jan not 1st Feb!

So when you ran the transfer command rsync -av --stats /nerc.... on the dtn02 command line are you saying this didn't work? If so can you try rsync'ing it to another directory on JASMIN please - I'm wondering if the /gws/nopw/j04/ncas_climate_vol1/users/xd904476/jasmin_roses/u-bo976 has become locked.

Cheers,
Ros.

comment:11 Changed 11 months ago by xd904476

Hi Ros,
I'll set the cycling to 1 month for the new suite.

About the rsync, yes: it seem to work for a bit and then it stops. it even says that the checksum has succeeded, but I can see the the files in the 2 dir are not identical. Perhaps after so many re-triggers something has gone wrong.

thanks

comment:12 Changed 11 months ago by ros

Hi Dani,

Can you please try the rsync command again from the dtn02 command line and send me the command you run & output of it so I can take a look. (Probably easier to send the output to me via email)

Cheers,
Ros.

comment:13 Changed 10 months ago by ros

Hi Dani,

I presume you managed to get this going again?

Cheers,
Ros.

comment:14 Changed 10 months ago by xd904476

Hi Ros,

it looks like is generally going now.
I'm experienccing a problem with u-bq062 in the pptransfer, but I think the problem is unrelated.
Thanks,
Dani

comment:15 Changed 9 months ago by xd904476

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.