Opened 2 years ago
Closed 2 years ago
#2683 closed help (fixed)
TRANSFER app
Reported by: | mvguarino | Owned by: | ros |
---|---|---|---|
Component: | Rose/Cylc | Keywords: | |
Cc: | Platform: | ARCHER | |
UM Version: |
Description
Hello,
I have updated the settings of my suite u-ba937 to add the PPTRANSFER task following this: http://cms.ncas.ac.uk/wiki/Docs/PostProcessingAppArcherSetup .
When I try to run the suite I now get this error:
[FAIL] cylc validate -v --strict u-ba937 # return-code=1, stderr= [FAIL] WARNING: deprecated items were automatically upgraded in 'suite definition': [FAIL] * (6.11.0) [runtime][RETRIES][retry delays] -> [runtime][RETRIES][job][execution retry delays] - value unchanged [FAIL] ERROR, bad graph node format: [FAIL] coupled[-P1M] => coupled => POSTPROC:succeed-all => pptransfer => \ [FAIL] Correct format is NAME(<PARAMS>)([CYCLE-POINT-OFFSET])(:TRIGGER-TYPE)
What did I do wrong?
Thanks,
Vittoria
Change History (32)
comment:1 Changed 2 years ago by ros
- Component changed from Coupled model to Rose/Cylc
comment:2 Changed 2 years ago by mvguarino
Hi Ros,
Thanks! I could never have guessed that was the problem, now it works… but I got a different error:
[FAIL] ssh -oBatchMode=yes dtn02.rdf.ac.uk bash --login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ --name=u-ba937\ --run=restart\ --remote=uuid=85037dba-967c-4077-8a9f-678607b999e8\' # return-code=255, stderr= [FAIL] Permission denied (publickey,password).
Note that I can login from PUMA to dtn02 using my ARCHER username without the need of password and/or passphrase.
Any idea what the problem could be?
Thank you,
Vittoria
comment:3 Changed 2 years ago by ros
Hi Vittoria,
Can you try adding the following to your ~/.ssh/config file on PUMA.
Host dtn02 dtn02.rdf.ac.uk Hostname dtn02.rdf.ac.uk User <your archer username> ForwardAgent no
Regards,
Ros.
comment:4 Changed 2 years ago by mvguarino
Hi Ros,
That didn't work, I still get the same error.
Vittoria
comment:5 Changed 2 years ago by ros
Hi Vittoria,
Can you remind me, have you used the transfer app before at all with any other suites?
Regards,
Ros.
comment:6 Changed 2 years ago by mvguarino
Hi Ros,
I have never used the transfer app before. Trying now for the first time with this suite.
Vittoria
comment:7 Changed 2 years ago by ros
- Owner changed from um_support to ros
- Status changed from new to accepted
Hi Vittoria,
Ok. Can you please copy my directory ~ros/roses/u-al624 on PUMA and then do a rose suite-run. This suite is setup to test connections between PUMA, DTN02 and JASMIN. Let me know how that goes.
Also have you set up your ssh between dtn02 and jasmin-xfer2.ceda.ac.uk so that you're not prompted for a passphrase?
Regards,
Ros.
comment:8 Changed 2 years ago by mvguarino
Mhhhh I got the same error again:
[INFO] symlink: /home/mvguarino/cylc-run/u-al624 <= /home/mvguarino/.cylc/u-al624 [FAIL] ssh -oBatchMode=yes dtn02.rdf.ac.uk bash --login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ --name=u-al624\ --run=run\ --remote=uuid=26a77bc5-2d9f-4430-81fb-f2e9acb77fa9\' # return-code=255, stderr= [FAIL] Permission denied (publickey,password).
Is there a way I can tell the system to ssh using my archer username?
If I type in ssh -oBatchMode=yes vittoria@dtn02.rdf.ac.uk I successfully connect. I tried to change the ‘host’ in suite.rc but that didn’t work.
I did set up the ssh-agent between dtn02 and JASMIN, and again I can login successfully usin my JASMIN username.
comment:9 Changed 2 years ago by ros
Hi Vittoria,
Re using your archer username to ssh to dtn02 that was the bit I asked you to add to your ~/.ssh/config file on PUMA above - the ssh will then pick up your username from this file.
Once you've checked the entries can you verify that this is working as expected by running ssh without specifying your username on PUMA:
ssh dtn02.rdf.ac.uk
Similarly on dtn02 make sure you have the following in your /nerc/n02/n02/vittoria/.ssh/config:
Host jasmin-xfer2 jasmin-xfer2.ceda.ac.uk Hostname jasmin-xfer2.ceda.ac.uk User <your archer username> IdentityFile <path to your jasmin ssh-key> ForwardAgent no
If you can't run the sshs without specifying your usernames it implies there is something wrong with the ssh config files. If you can't see the problem, can you copy your .ssh/config files to somewhere I can see them please.
Ros.
comment:10 Changed 2 years ago by mvguarino
This is my config file on PUMA (I have also made a copy of it in /home/mvguarino, you should be able to see it)
Host login*.archer.ac.uk User vittoria Host dtn02 dtno2.rdf.ac.uk Hostname dtn02.rdf.ac.uk User vittoria ForwardAgent no
And yes ssh dtn02.rdf.ac.uk doesn not work.
I have added what you suggested on my config file on dtn02 and now I can ssh to xfer2 without using my username.
Vittoria
comment:11 Changed 2 years ago by mvguarino
Found the problem, just now that I have copied it to this ticket: typo in the second dtn02…
Sorry didn't see this earlier.
However, I now get this error:
[FAIL] ssh -oBatchMode=yes dtn02.rdf.ac.uk bash --login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ --name=u-ba937\ --run=restart\ --remote=uuid=acdf1cca-b45d-4b12-8ac5-b624e4cd7cf9\' # return-code=127, stderr= [FAIL] bash: rose: command not found
comment:12 Changed 2 years ago by dcase
Vittoria,
your inability to run Rose may be similar to that in ticket #2022, which suggests exporting a variable into your ~/.profile:
export PATH=$PATH:$UMDIR/software/bin
then it can pick up the Rose executable.
Bare in mind too that ARCHER's filesystems are undergoing maintenance, which will affect any transfers that you're trying to perform. The status of the computers are shown here:
https://www.archer.ac.uk/status/
Dave
comment:13 Changed 2 years ago by ros
Hi Vittoria,
Please look in my .profile on dtn02 (username ros). I think from memory the path is /general/y07/umshared/software/bin. You'll need to set full path as $UMDIR is not set on DTN.
Ros
comment:14 Changed 2 years ago by ros
Back to my desk now so just looked in your .profile. You've got the correct PATH export there, you just need to uncomment it.
comment:15 Changed 2 years ago by mvguarino
Hi,
I had already added the path to my .profile on dtn02 following the instructions given here (http://cms.ncas.ac.uk/wiki/Docs/PostProcessingAppArcherSetup - last point). It turned out I had to add it to my .bash_profile to make it work (I then commented the one in .profile)
Suite is running now, however.. I can’t see the transfer app in the GUI, which may be a bad sign …
Vittoria
comment:16 Changed 2 years ago by mvguarino
and actually, not sure things are related, but postproc_nemo and postproc_cice are now 'retrying' (suite u-ba937)
comment:17 Changed 2 years ago by ros
Hi Vittoria,
Looking at the status files in your cylc-run directory am I correct in thinking that you are trying to add pptransfer to an already running suite? If so this is rather more complicated. It looks like you have already run the model for 5 cycles. Can you confirm this is what you are doing and also if the first cycle is still showing in the cylc GUI or not. Thanks.
Ros.
comment:18 Changed 2 years ago by mvguarino
Hi Ros,
Yes, my simulation has been running for already quite a long time and the first cycle is not in the GUI anymore.
Am I trying to do something that is not feasible?
(guess the alternative would be to restart it as a new-run using archived restart files?)
Vittoria
comment:19 Changed 2 years ago by mvguarino
In the meantime – I don’t know what happened- postproc for nemo and cice is failing:
[WARN] [SUBPROCESS]: Command: ncdump -hs /work/n02/n02/vittoria/cylc-run/u-ba937/share/data/History_Data/CICEhist/archive_ready/cice_ba937i_1d_19491101-19491201.nc [SUBPROCESS]: Error = 1: ncdump: invalid option -- 's' ncdump [-V|-c|-h|-u] [-v ...] [[-b|-f] [c|f]] [-l len] [-n name] [-d n[,n]] file [-V] Display version of the HDF4 library and exit [-c] Coordinate variable data and header information [-h] Header information only, no data [-u] Replace nonalpha-numerics in names with underscores [-v var1[,...]] Data for variable(s) <var1>,... only [-b [c|f]] Brief annotations for C or Fortran indices in data [-f [c|f]] Full annotations for C or Fortran indices in data [-l len] Line length maximum in data section (default 80) [-n name] Name for netCDF (default derived from file name) [-d n[,n]] Approximate floating-point values with less precision file File name of input netCDF file
Could you please advise on this error too? something is wrong with the netcdf handling, never had this before.
Thanks,
Vittoria
comment:20 Changed 2 years ago by ros
Hi Vittoria,
I don't think it's possible to insert tasks into cycles that have finished a long time ago as the cycle information will have been cleaned up. So I very much doubt it would work for your suite. Equally if you have been running this suite for many years already it's not really feasible to manually insert the the pptransfer task into every cycle that has already run.
I would suggest that it would be easier to restart as a new run and include the pptransfer from the beginning then.
Regarding the postproc_nemo & postproc_cice this error indicates a mismatch in the version of netcdf being used in the model run and the postproc.
I think the problem has occurred when you added the line:
pre-script = "module load nco/4.6.8; module load anaconda; export PYTHONPATH=$PYTHONPATH:$UMDIR/lib/python2.7; module list; ulimit -s unlimited"
to the [[POSTPROC]] family. I need to make it clear in the instructions that this may or may not be needed depending on your suite setup. You already had postproc running so this change wasn't needed. Try removing this line, reloading and retriggering.
Sorry that this is becoming a bit long winded. In the Rose/Cylc world it's unfortunately impossible to write instructions that covers all possible setups.
Regards,
Ros.
comment:21 Changed 2 years ago by mvguarino
Hi Ros,
That's fine, thank you. It didn't occur to me that this could be a problem, I thought I could just add the transfer app at any stage and the new files will be moved to JASMIN (while I would transfer manually all the others).
As the simulation is half-way now, I thought by doing this I would buy myself some time.
As for postproc, I did wonder If I had to add the pre-script line…
I removed it, and it works now!
Thanks,
Vittoria
comment:22 Changed 2 years ago by ros
Hi Vittoria,
Don't stop the suite just yet, I think I have away of at least getting transfer to run from the your current cycle without having to stop and do a new run. I'm just putting together some instructions.
Cheers,
Ros.
comment:23 Changed 2 years ago by ros
I think we can insert the tasks into the running suite so it will just start doing the transfer from where the suite has currently got to - I've just tried it and it's worked. If you haven't already stopped the suite and you'd like to give this a go try doing the following:
- In the Cylc GUI: Control —> Insert Task(s)…
- Set TASK-NAME.CYCLE-POINT=fcm_make_pptransfer.<YYYYMMDDT0000Z>, where <YYYYMMDDT0000Z> is an active cycle point (e.g.19491201T0000Z)
- Leave stop-point=POINT blank
- Check the "Do not check if a cycle point is valid or not" box
- Insert, and wait for the task to complete. You may need to manually trigger it.
- Do steps 1-5 for the task-names fcm_make2_pptransfer and pptransfer
Hopefully that will work. You may need to insert the pptransfer task into all the active cycle points, once it's been inserted into the last active cycle point showing in the cylc GUI it should then go on to include it automatically in all new ones.
Regards,
Ros.
comment:24 Changed 2 years ago by mvguarino
Hi Ros,
Thanks! fcm_make_pptransfer and fcm_make2_pptransfer succeeded, now I am waiting for the coupled task to run and see what will happen with the PPTRANSFER task (currently waiting ).
Fingers crossed,
Vittoria
comment:25 Changed 2 years ago by mvguarino
Hi Ros,
The Transfer task is failing with what seems to be again a permission problem:
[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch [WARN] file:nemocicepp.nl: skip missing optional source: namelist:moose_arch [WARN] [SUBPROCESS]: Command: rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/pmip4_vol1/users/vittoria/u-ba937/19500101T0000Z && rsync /nerc/n02/n02/vittoria/u-ba937/19500101T0000Z/ jasmin-xfer2.ceda.ac.uk:/gws/nopw/j04/pmip4_vol1/users/vittoria/u-ba937/19500101T0000Z [SUBPROCESS]: Error = 255: Access to this system is monitored and restricted to authorised users. If you do not have authorisation to use this system, you should not proceed beyond this point and should disconnect immediately. Unauthorised use could lead to prosecution. (See also - http://www.stfc.ac.uk/aup) ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory Permission denied (publickey,gssapi-keyex,gssapi-with-mic). rsync: connection unexpectedly closed (0 bytes received so far) [sender] rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6] [WARN] Transfer command failed: rsync -av --stats --rsync-path="mkdir -p /gws/nopw/j04/pmip4_vol1/users/vittoria/u-ba937/19500101T0000Z && rsync" /nerc/n02/n02/vittoria/u-ba937/19500101T0000Z/ jasmin-xfer2.ceda.ac.uk:/gws/nopw/j04/pmip4_vol1/users/vittoria/u-ba937/19500101T0000Z [ERROR] transfer.py: Unknown Error - Return Code=255 [FAIL] Command Terminated [FAIL] Terminating PostProc... [FAIL] transfer.py # return-code=1 Received signal ERR cylc (scheduler - 2018-11-29T17:06:36Z): CRITICAL Task job script received signal ERR at 2018-11-29T17:06:36Z cylc (scheduler - 2018-11-29T17:06:36Z): CRITICAL failed at 2018-11-29T17:06:36Z
I have checked and I can log into jasmin-xfer2 from dtn02, but I noticed only if I use the -A option:
ssh -A jasmin-xfer2.ceda.ac.uk
Otherwise I am asked for passphrase.
Vittoria
comment:26 Changed 2 years ago by ros
Having to use the -A option (which forwards a existing agent which no good for cylc) implies that your ssh-agent is not running properly on dtn02.
Log in to dtn02 from puma (without using -A), try running ssh-add to add your jasmin key to the agent. I suspect that you may get an error connecting to the agent. If so you will need to remove the ~/.ssh/environment.dtn02 file log out and back in again to start up a new agent and then run ssh-add. I'm hoping that will fix the problem.
Regards,
Ros.
comment:27 Changed 2 years ago by mvguarino
Hi Ros,
There was indeed a problem with the ssh-agent, however restarting it and running ssh-add again didn’t solve the problem. The ssh-agent runs fine within the current session but I am asked for passphrase at each login (coming to think of it this happens to me also when I log into JASMIN from my local unix account).
The only way around it I could find is the following (for future reference in case someone will have the same problem):
I generated a new pair of key (public and private) on dtn02 without passphrase. I copied the new public key into the authorized_keys2* file on JASMIN and run ssh-add on dtn02 adding the new private key.
This seemed to work, but to avoid future similar problems I added to my .bash_profile on dtn02 :
eval $(ssh-agent) ssh-add ~/.ssh/new_private_key
so the new identity is added at each login.
Everything seems to be working now, the transfer app is moving the desired data to JASMIN!
Thanks for your help,
Vittoria
*when I added the new public key to the authorized_keys file, the latter kept on being overwritten (I don’t know why and by what process) every 10 min or so and the new key disappeared (and the access to the machine with it).
comment:28 Changed 2 years ago by ros
Hi Vittoria,
It's great that you have tried other things out, however, using a passphraseless key is breaching JASMIN security. JASMIN automatically overwrites the authorized_key file every few minutes to be sure that it only contains the key that you have uploaded to the JASMIN portal.
I know you want to get on with transferring the data, but we will still need to work out what is going on here. Before you added the above 2 lines to your .bash_profile on dtn02 can you confirm you did definitely had:
. ~/.ssh/ssh-setup
or
. ~/.ssh/setup
in your .bash_profile depending on what you called the script? The essential point being the "." at the beginning. Missing that off will cause the setup to die on exit.
A colleague has also just pointed out that you have over 80 ssh-agent processes currently running on dtn02 which will cause problems. The eval $(ssh-agent) starts up a new ssh-agent process on every single login.
You will need to kill all these processes and then please try the original ssh-setup again.
Regards,
Ros.
comment:29 Changed 2 years ago by mvguarino
Hi Ros,
ahaha, sorry I will change back my .bash_profile, as that is causing problems.
I did have in my .profile
. ~/.ssh/ssh-setup
However, just like for the $PATH environmental variable that was not working (and I didn't realize so until now). Now that I have added the line above to my .bash_profile it is finally working!
Vittoria
comment:30 Changed 2 years ago by ros
Hi Vittoria,
Phew! Glad it's all working now.
I have updated the setup instructions with some of the gotchas encountered so hopefully make it a little easier for the next person.
Have a good weekend.
Regards,
Ros.
comment:31 Changed 2 years ago by mvguarino
Many thanks for your help,
Vittoria
comment:32 Changed 2 years ago by willie
- Resolution set to fixed
- Status changed from accepted to closed
Hi Vittoria,
This is one of those annoying ones where everything looks fine…..except there are extra spaces at the end of a line. This is ok in a lot of places except in the middle of a graph line.
The offending line is:
pptransfer {{ '=> \\' if HOUSEKEEP else '' }}
Remove the extra spaces at the end of this line and it should validate and run ok.
Cheers,
Ros.