Opened 7 months ago

Closed 7 months ago

#3205 closed help (answered)

pptransfer not working on archer /work

Reported by: xd904476 Owned by: um_support
Component: Archiving Keywords:
Cc: Platform: ARCHER
UM Version:

Description

Hi, I have been following instructions on this webpage to transfer suite results on archer /work, but my fcm_make2_pptrasfer gets always stuck on submitted not allowing the pptransfer to run.
I don't get any errors, therefore I'm not sure about what to do. An example is suite u-br869.
Any suggestions pls?

thank you

Change History (16)

comment:2 Changed 7 months ago by ros

Hi Dani,

I suspect it is a problem connecting to jasmin, but can't currently find an error message. Can you please change the permissions on your JASMIN home directory so that we can see what's going on in there.

chmod g+rx /home/users/<jasmin-usrname>

Cheers,
Ros.

comment:3 Changed 7 months ago by xd904476

Hi Ros,
I've just changed the permissions.
thanks

comment:4 Changed 7 months ago by ros

Hi Dani,

The fcm_make2_pptransfer has succeeded in all your suites.

The ones that have gone on to the actual transfer step have failed because you still have an ARCHER module load line in the suite.rc file for pptransfer task.

Error messages are in the job.err files on JASMIN:
/home/users/xd904476/cylc-run/u-br867/log/job/20150101T0000Z/pptransfer/01/job: line 159: module: command not found

For the [[pptransfer]] remove the line pre-script = "module load anaconda"

I've adjusted the instructions to say you may need to look elsewhere for this line depending on suite setup!! :-)

Cheers,
Ros.

comment:5 Changed 7 months ago by xd904476

Hi Ros,
I was puzzled about that: I've deleted it from my suite.rc. But the pptransfer fails for some communication error anyway:

Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Warning: No xauth data; using fake authentication data for X11 forwarding.
/usr/bin/xauth: error in locking authority file /home/n02/n02/dflocco/.Xauthority
ls: cannot access /work/n02/n02/dflocco/archive/u-bq739/ensemble_0/20150101T0000Z/checksums: No such file or directory

[WARN] [SUBPROCESS]: Command: ssh -oBatchMode=yes login.archer.ac.uk -n cd /work/n02/n02/dflocco/archive/u-bq739/ensemble_0/20150101T0000Z ; md5sum * > checksums
[SUBPROCESS]: Error = 137:

————————————————————————————————————————

This is a private computing facility. Access to this service is limited to those
who have been granted access by the operating service provider on behalf of the
contracting authority and use is restricted to the purposes for which access was
granted. All access and usage are governed by the terms and conditions of access
agreed to by all registered users and are thus subject to the provisions of the
Computer Misuse Act, 1990 under which unauthorised use is a criminal offence.

If you are not authorised to use this service you must disconnect immediately.


Warning: untrusted X11 forwarding setup failed: xauth key data not generated
Warning: No xauth data; using fake authentication data for X11 forwarding.
bash: line 1: 34431 Killed md5sum * > checksums

[WARN] Failed to generate checksums.
[ERROR] Checksum generation failed.
[FAIL] Command Terminated
[FAIL] Terminating PostProc?
[FAIL] transfer.py <<'STDIN'
[FAIL]
[FAIL] 'STDIN' # return-code=1
Received signal ERR
2020-02-25T21:02:25Z CRITICAL - Task job script received signal ERR

I can login from puma to archer and jasmin and from jasmin to archer with no requests. Am I missing some other link?

thanks,
dani

comment:6 Changed 7 months ago by ros

Hi Dani,

You've got X11 forwarding on from JASMIN to ARCHER which doesn't work, however these are only warnings so can be ignored. I'll need to see your ~/.ssh/config files at some point to work that one out. Don't post the contents on here.

The real problem is the failed generation of the checksums bash: line 1: 34431 Killed md5sum * > checksums.
The command has been killed by ARCHER so I will need to investigate.

Cheers,
Ros.

comment:7 Changed 7 months ago by ros

Hi Dani,

For now please switch off verify_chksums in the "postproc → common settings → JASMIN transfer" panel.

Cheers,
Ros.

comment:8 Changed 7 months ago by xd904476

Hi Ros, tries that and now I get a different error. Still on the communication side of things.
This is the new error: mux_client_request_session: session request failed: Session open refused by peer

Access to this system is monitored and restricted to
authorised users. If you do not have authorisation
to use this system, you should not proceed beyond
this point and should disconnect immediately.

Unauthorised use could lead to prosecution.

(See also - http://www.stfc.ac.uk/aup)

ControlSocket? /tmp/ssh-socket-xd904476-xd904476@… already exists, disabling multiplexing
[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:moose_arch
[WARN] [SUBPROCESS]: Command: ssh -oBatchMode=yes login.archer.ac.uk -n ls -A /work/n02/n02/dflocco/archive/u-bq739/ensemble_2/20150101T0000Z | wc -l
[SUBPROCESS]: Error = 255:

ssh_exchange_identification: Connection closed by remote host

[FAIL] Failed: ssh failed whilst checking for files to transfer
[FAIL] Terminating PostProc?
[FAIL] transfer.py <<'STDIN'
[FAIL]
[FAIL] 'STDIN' # return-code=1
Received signal ERR
2020-02-26T11:44:04Z CRITICAL - Task job script received signal ERR

any ideas?

comment:9 Changed 7 months ago by ros

Have you tried retriggering that task? Have other ensemble members worked ok?

Cheers,
Ros.

comment:10 Changed 7 months ago by xd904476

Hi,
I have made the change, restarted the agent to login to jasmin, reloaded the suite u-bq739 and retriggered the pptransfer.

it has never worked until now

comment:11 Changed 7 months ago by ros

I've just checked ensemble_0, ensemble_1, ensemble_7 & ensemble_13 for u-bq739 and they have all succeeded and data is on JASMIN. Try retriggering ensemble_2 pptransfer…..

comment:12 Changed 7 months ago by xd904476

They all appeared as just submitted, sorry. I must have checked some of the ones that did not work.
I'l retrigger them all.
Thanks
dani

comment:13 Changed 7 months ago by xd904476

  • Resolution set to fixed
  • Status changed from new to closed

comment:14 Changed 7 months ago by xd904476

  • Resolution fixed deleted
  • Status changed from closed to reopened

Hi I am having further issues with the pptransfer on suite u-bs361. I get again an error on the globus command:

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:moose_arch
[WARN] [SUBPROCESS]: Command: globus-url-copy -vb -cd -cc 4 -sync sshftp://login.archer.ac.uk/work/n02/n02/dflocco/archive/u-bs361/20150101T0000Z/ file:///gws/nopw/j04/ncas_climate_vol1/users/xd904476/jasmin_roses/u-bs361/20150101T0000Z/
[SUBPROCESS]: Error = 1:

error: Unable to list url sshftp://login.archer.ac.uk/work/n02/n02/dflocco/archive/u-bs361/20150101T0000Z/:
globus_ftp_client: the server responded with an error
500 Server is not configured for SSHFTP connections.

[WARN] Transfer command failed: globus-url-copy -vb -cd -cc 4 -sync sshftp://login.archer.ac.uk/work/n02/n02/dflocco/archive/u-bs361/20150101T0000Z/ file:///gws/nopw/j04/ncas_climate_vol1/users/xd904476/jasmin_roses/u-bs361/20150101T0000Z/
[ERROR] transfer.py: Unknown Error - Return Code=1
[FAIL] Command Terminated
[FAIL] Terminating PostProc?
[FAIL] transfer.py <<'STDIN'
[FAIL]
[FAIL] 'STDIN' # return-code=1
Received signal ERR
2020-03-09T15:13:18Z CRITICAL - Task job script received signal ERR

I have fixed this error in tha past by deleting the option "-p4", but now I don't know what to do to fix it.
could you help pls?
Dani

comment:15 Changed 7 months ago by ros

Hi Dani,

Please change the suite pptransfer to use rsync rather than gridftp like you are for u-bq739.

Cheers,
Ros.

comment:16 Changed 7 months ago by ros

  • Component changed from UM Model to Archiving
  • Platform set to ARCHER
  • Resolution set to answered
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.