Opened 7 weeks ago

Closed 10 days ago

#3275 closed help (fixed)

pp transfer

Reported by: eelrm Owned by: um_support
Component: UM Model Keywords: pptransfer archer
Cc: Platform: JASMIN
UM Version:

Description

I'm trying to run a job on ARCHER, which has worked with the new workflow you sent yesterday, but the suite is set up to archive to Jasmin and has got stuck on the fcm_make2_pptransfer.

Is there also a workaround to get the pp transfer to work?

Change History (24)

comment:1 Changed 7 weeks ago by grenville

Lauren

What is the suite id?
Please allow us read access to your JASMIN home directory

Grenville

comment:2 Changed 7 weeks ago by eelrm

Hi Grenville,

Thanks for making the ticket. The suite ID is u-bu651. I think I've changed the permissions on Jasmin. I also noticed that the fcm_make2_pptransfer appeared to have succeeded in the output files but was still reading as submitted in the rose GUI. I set this task to succeeded and it has now failed during pp transfer with 'Permission denied'.

Many thanks,

Lauren

comment:3 Changed 7 weeks ago by grenville

Hi Lauren

Please try this (this effectively mimics what you have just done on PUMA):

1) in site/archer.rc change jasmin-xfer2 to jasmin-xfer3
2) login to jasmin-xfer3 and setup .ssh/config file exactly as on PUMA (you will need to put put your new ARCHER private key on JASMIN)
3) Add the ARCHER private key to the ssh-agent (I'm assuming you have an agent running on xfer3)
4) from xfer3 login to ARCHER
5) logout of ARCHER
6) reload the suite (rose suite-run —reload)
7) retrigger the transfer task

You should be OK to logout of JASMIN - the connecion established in (1-4) should persist and enable the data transfer. The connection will need to be reset periodically (repeat 4)

Grenville

comment:4 Changed 7 weeks ago by grenville

Lauren

You don't need an ssh agent to be running - you can ignore (3).

Grenvill

comment:5 Changed 7 weeks ago by eelrm

Hi Grenville,

I had trouble logging into ARCHER from xfer3 so recopied the ssh-agent start up script and changed xfer2 to xfer3 in my bashrc. Should I not have done this? I can now log onto ARCHER from xfer3.

I had some trouble reloading - it didn't respond so I stopped the suite and restarted. pptransfer has failed again with permission denied.

Thanks,

Lauren

comment:6 Changed 6 weeks ago by grenville

Lauren

Please go to jasmin-xfer3, login to ARCHER, logout of ARCHER, then run the command

ssh -O check login.archer.ac.uk

what is the output?

Grenville

comment:7 Changed 6 weeks ago by eelrm

Hi Grenville,

Think I may have found my problem - I was logging into login1 directly which gave: Control socket connect(/tmp/ssh-socket-eelrm@…): No such file or directory. Logging on to login.archer gave 'Master running…'. I retriggered the pptransfer task. It's still listing as 'submitted' in the GUI but appears to have been successful. Should I manually reset as succeeded? Is there any reason why it's not doing this automatically?

Thanks,

Lauren

comment:8 Changed 6 weeks ago by grenville

Lauren

we still can't see your jasmin home direcrory, do this:

chmod -R g+rX /home/users/eelrm

Sounds like it's worked. If so, try polling the task - or set to succeeded. I don't know why cylc has not updated the status; that's usually indicative of comms problem.

Grenville

comment:9 Changed 6 weeks ago by eelrm

Hi Grenville, I reset the permissions to be able to add my archer key. I've re-run now. Ok, great. I've reset the state and it's moved onto housekeeping.

Thanks,

Lauren

comment:10 Changed 6 weeks ago by eelrm

Hi Grenville,

I've managed to get most of the pptransfers through but it's got stuck on the penultimate cycle and I'm not sure what has changed. I've logged in to archer and jasmin, but keep getting a critical failure.

Thanks,

Lauren

comment:11 Changed 6 weeks ago by grenville

I no longer have permissions

cd ~eelrm
-bash: cd: /home/users/eelrm: Permission denied

Grenville

comment:12 Changed 6 weeks ago by eelrm

Updated. I do now have the following when I log into xfer3: mux_client_request_session: read from master failed: Broken pipe

comment:13 Changed 6 weeks ago by grenville

Lauren

Check that the master connection is active from xfer3 to ARCHER - it should activate by logging into ARCHER from xfer3.

log out of xfer3, then login again and try :

ssh login.archer.ac.uk -n ls /work/n02/n02/eelrm/archive/u-bu651/21911001T0000Z/checksums

what happens?

Grenville

comment:14 Changed 6 weeks ago by eelrm

It lists: /work/n02/n02/eelrm/archive/u-bu651/21911001T0000Z/checksums

comment:15 Changed 6 weeks ago by grenville

pl retrigger the pptransfer task

comment:16 Changed 6 weeks ago by eelrm

Hi Grenville, it's been killed again

comment:17 Changed 6 weeks ago by grenville

hmm - it's failing to generate checksums

try this on ARCHER

cd /work/n02/n02/eelrm/archive/u-bu651/21911001T0000Z
rm checksums
md5sum * > checksums

comment:18 Changed 6 weeks ago by grenville

Lauren

It may be simplest and most expedient to switch off checksumming. The current ARCHER 2FA issues have thrown up some problems which will be resolved when the dust settles.

Grenville

comment:19 Changed 6 weeks ago by eelrm

Hi Grenville, okay I'll give it a go. Do I just need to untick the verify_chksums box under jasmin transfer and then reload the suite? Thanks

comment:20 Changed 6 weeks ago by grenville

Lauren

Yes - that should do it

Grenville

comment:21 Changed 4 weeks ago by eelrm

Hi Grenville,

The pptransfer seems to be working consistently now but still stays as 'submitted' in the GUI and so I have to manually reset to succeeded every cycle. Is there anything I can try to sort out the communication?

Thanks,

Lauren

comment:22 Changed 10 days ago by eelrm

Hi Grenville,

I'm still having problems with this. I am trying to run a copy of the suite above, ID is u-bv666. I could no longer log onto xfer3 and saw that the instructions for pptransfer had changed so updated my suite to instead push the data but I am still getting permission denied. I can log on to xfer2 from espp1/2 fine.

Thanks,
Lauren

comment:23 Changed 10 days ago by eelrm

Just realised that I had not changed the transfer_type and remote_host in the post processing panel. It is now running - please ignore the above! Thanks, Lauren

comment:24 Changed 10 days ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed

Good catch

Note: See TracTickets for help on using tickets.