Opened 8 months ago

Closed 7 months ago

#2683 closed help (fixed)

TRANSFER app

Reported by: mvguarino Owned by: ros
Component: Rose/Cylc Keywords:
Cc: Platform: ARCHER
UM Version:

Description

Hello,

I have updated the settings of my suite u-ba937 to add the PPTRANSFER task following this: http://cms.ncas.ac.uk/wiki/Docs/PostProcessingAppArcherSetup .
When I try to run the suite I now get this error:

[FAIL] cylc validate -v --strict u-ba937 # return-code=1, stderr=
[FAIL] WARNING: deprecated items were automatically upgraded in 'suite definition':
[FAIL]  * (6.11.0) [runtime][RETRIES][retry delays] -> [runtime][RETRIES][job][execution retry delays] - value unchanged
[FAIL] ERROR, bad graph node format:
[FAIL]   coupled[-P1M] => coupled => POSTPROC:succeed-all => pptransfer => \
[FAIL] Correct format is NAME(<PARAMS>)([CYCLE-POINT-OFFSET])(:TRIGGER-TYPE)

What did I do wrong?

Thanks,

Vittoria

Change History (32)

comment:1 Changed 8 months ago by ros

  • Component changed from Coupled model to Rose/Cylc

Hi Vittoria,

This is one of those annoying ones where everything looks fine…..except there are extra spaces at the end of a line. This is ok in a lot of places except in the middle of a graph line.

The offending line is:

pptransfer {{ '=> \\' if HOUSEKEEP else '' }}

Remove the extra spaces at the end of this line and it should validate and run ok.

Cheers,
Ros.

comment:2 Changed 8 months ago by mvguarino

Hi Ros,

Thanks! I could never have guessed that was the problem, now it works… but I got a different error:

[FAIL] ssh -oBatchMode=yes dtn02.rdf.ac.uk bash --login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ --name=u-ba937\ --run=restart\ --remote=uuid=85037dba-967c-4077-8a9f-678607b999e8\' # return-code=255, stderr=
[FAIL] Permission denied (publickey,password).

Note that I can login from PUMA to dtn02 using my ARCHER username without the need of password and/or passphrase.

Any idea what the problem could be?

Thank you,

Vittoria

comment:3 Changed 8 months ago by ros

Hi Vittoria,

Can you try adding the following to your ~/.ssh/config file on PUMA.

Host dtn02 dtn02.rdf.ac.uk
Hostname dtn02.rdf.ac.uk
User <your archer username>
ForwardAgent no

Regards,
Ros.

comment:4 Changed 8 months ago by mvguarino

Hi Ros,

That didn't work, I still get the same error.

Vittoria

comment:5 Changed 8 months ago by ros

Hi Vittoria,

Can you remind me, have you used the transfer app before at all with any other suites?

Regards,
Ros.

comment:6 Changed 8 months ago by mvguarino

Hi Ros,
I have never used the transfer app before. Trying now for the first time with this suite.

Vittoria

comment:7 Changed 8 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Vittoria,

Ok. Can you please copy my directory ~ros/roses/u-al624 on PUMA and then do a rose suite-run. This suite is setup to test connections between PUMA, DTN02 and JASMIN. Let me know how that goes.

Also have you set up your ssh between dtn02 and jasmin-xfer2.ceda.ac.uk so that you're not prompted for a passphrase?

Regards,
Ros.

comment:8 Changed 8 months ago by mvguarino

Mhhhh I got the same error again:

[INFO] symlink: /home/mvguarino/cylc-run/u-al624 <= /home/mvguarino/.cylc/u-al624
[FAIL] ssh -oBatchMode=yes dtn02.rdf.ac.uk bash --login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ --name=u-al624\ --run=run\ --remote=uuid=26a77bc5-2d9f-4430-81fb-f2e9acb77fa9\' # return-code=255, stderr=
[FAIL] Permission denied (publickey,password).

Is there a way I can tell the system to ssh using my archer username?
If I type in ssh -oBatchMode=yes vittoria@dtn02.rdf.ac.uk I successfully connect. I tried to change the ‘host’ in suite.rc but that didn’t work.
I did set up the ssh-agent between dtn02 and JASMIN, and again I can login successfully usin my JASMIN username.

comment:9 Changed 8 months ago by ros

Hi Vittoria,

Re using your archer username to ssh to dtn02 that was the bit I asked you to add to your ~/.ssh/config file on PUMA above - the ssh will then pick up your username from this file.

Once you've checked the entries can you verify that this is working as expected by running ssh without specifying your username on PUMA:

ssh dtn02.rdf.ac.uk

Similarly on dtn02 make sure you have the following in your /nerc/n02/n02/vittoria/.ssh/config:

Host jasmin-xfer2 jasmin-xfer2.ceda.ac.uk
Hostname jasmin-xfer2.ceda.ac.uk
User <your archer username>
IdentityFile <path to your jasmin ssh-key>
ForwardAgent no

If you can't run the sshs without specifying your usernames it implies there is something wrong with the ssh config files. If you can't see the problem, can you copy your .ssh/config files to somewhere I can see them please.

Ros.

comment:10 Changed 8 months ago by mvguarino

This is my config file on PUMA (I have also made a copy of it in /home/mvguarino, you should be able to see it)

Host login*.archer.ac.uk
    User vittoria

Host dtn02 dtno2.rdf.ac.uk
Hostname dtn02.rdf.ac.uk
User vittoria
ForwardAgent no

And yes ssh dtn02.rdf.ac.uk doesn not work.

I have added what you suggested on my config file on dtn02 and now I can ssh to xfer2 without using my username.

Vittoria

comment:11 Changed 8 months ago by mvguarino

Found the problem, just now that I have copied it to this ticket: typo in the second dtn02…
Sorry didn't see this earlier.

However, I now get this error:

[FAIL] ssh -oBatchMode=yes dtn02.rdf.ac.uk bash --login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ --name=u-ba937\ --run=restart\ --remote=uuid=acdf1cca-b45d-4b12-8ac5-b624e4cd7cf9\' # return-code=127, stderr=
[FAIL] bash: rose: command not found

comment:12 Changed 8 months ago by dcase

Vittoria,

your inability to run Rose may be similar to that in ticket #2022, which suggests exporting a variable into your ~/.profile:

export PATH=$PATH:$UMDIR/software/bin

then it can pick up the Rose executable.
Bare in mind too that ARCHER's filesystems are undergoing maintenance, which will affect any transfers that you're trying to perform. The status of the computers are shown here:

https://www.archer.ac.uk/status/

Dave

comment:13 Changed 8 months ago by ros

Hi Vittoria,

Please look in my .profile on dtn02 (username ros). I think from memory the path is /general/y07/umshared/software/bin. You'll need to set full path as $UMDIR is not set on DTN.

Ros

comment:14 Changed 8 months ago by ros

Back to my desk now so just looked in your .profile. You've got the correct PATH export there, you just need to uncomment it.

comment:15 Changed 8 months ago by mvguarino

Hi,

I had already added the path to my .profile on dtn02 following the instructions given here (http://cms.ncas.ac.uk/wiki/Docs/PostProcessingAppArcherSetup - last point). It turned out I had to add it to my .bash_profile to make it work (I then commented the one in .profile)
Suite is running now, however.. I can’t see the transfer app in the GUI, which may be a bad sign …

Vittoria

comment:16 Changed 8 months ago by mvguarino

and actually, not sure things are related, but postproc_nemo and postproc_cice are now 'retrying' (suite u-ba937)

comment:17 Changed 8 months ago by ros

Hi Vittoria,

Looking at the status files in your cylc-run directory am I correct in thinking that you are trying to add pptransfer to an already running suite? If so this is rather more complicated. It looks like you have already run the model for 5 cycles. Can you confirm this is what you are doing and also if the first cycle is still showing in the cylc GUI or not. Thanks.

Ros.

comment:18 Changed 8 months ago by mvguarino

Hi Ros,

Yes, my simulation has been running for already quite a long time and the first cycle is not in the GUI anymore.
Am I trying to do something that is not feasible?

(guess the alternative would be to restart it as a new-run using archived restart files?)

Vittoria

comment:19 Changed 8 months ago by mvguarino

In the meantime – I don’t know what happened- postproc for nemo and cice is failing:

[WARN]  [SUBPROCESS]: Command: ncdump -hs /work/n02/n02/vittoria/cylc-run/u-ba937/share/data/History_Data/CICEhist/archive_ready/cice_ba937i_1d_19491101-19491201.nc
[SUBPROCESS]: Error = 1:
	ncdump: invalid option -- 's'
ncdump [-V|-c|-h|-u] [-v ...] [[-b|-f] [c|f]] [-l len] [-n name] [-d n[,n]] file
  [-V]             Display version of the HDF4 library and exit
  [-c]             Coordinate variable data and header information
  [-h]             Header information only, no data
  [-u]             Replace nonalpha-numerics in names with underscores
  [-v var1[,...]]  Data for variable(s) <var1>,... only
  [-b [c|f]]       Brief annotations for C or Fortran indices in data
  [-f [c|f]]       Full annotations for C or Fortran indices in data
  [-l len]         Line length maximum in data section (default 80)
  [-n name]        Name for netCDF (default derived from file name)
  [-d n[,n]]       Approximate floating-point values with less precision
  file             File name of input netCDF file  

Could you please advise on this error too? something is wrong with the netcdf handling, never had this before.

Thanks,

Vittoria

comment:20 Changed 8 months ago by ros

Hi Vittoria,

I don't think it's possible to insert tasks into cycles that have finished a long time ago as the cycle information will have been cleaned up. So I very much doubt it would work for your suite. Equally if you have been running this suite for many years already it's not really feasible to manually insert the the pptransfer task into every cycle that has already run.

I would suggest that it would be easier to restart as a new run and include the pptransfer from the beginning then.

Regarding the postproc_nemo & postproc_cice this error indicates a mismatch in the version of netcdf being used in the model run and the postproc.

I think the problem has occurred when you added the line:

pre-script = "module load nco/4.6.8; module load anaconda; export PYTHONPATH=$PYTHONPATH:$UMDIR/lib/python2.7; module list; ulimit -s unlimited"

to the [[POSTPROC]] family. I need to make it clear in the instructions that this may or may not be needed depending on your suite setup. You already had postproc running so this change wasn't needed. Try removing this line, reloading and retriggering.

Sorry that this is becoming a bit long winded. In the Rose/Cylc world it's unfortunately impossible to write instructions that covers all possible setups.

Regards,
Ros.

comment:21 Changed 8 months ago by mvguarino

Hi Ros,

That's fine, thank you. It didn't occur to me that this could be a problem, I thought I could just add the transfer app at any stage and the new files will be moved to JASMIN (while I would transfer manually all the others).
As the simulation is half-way now, I thought by doing this I would buy myself some time.

As for postproc, I did wonder If I had to add the pre-script line…
I removed it, and it works now!

Thanks,

Vittoria

comment:22 Changed 8 months ago by ros

Hi Vittoria,

Don't stop the suite just yet, I think I have away of at least getting transfer to run from the your current cycle without having to stop and do a new run. I'm just putting together some instructions.

Cheers,
Ros.

comment:23 Changed 8 months ago by ros

I think we can insert the tasks into the running suite so it will just start doing the transfer from where the suite has currently got to - I've just tried it and it's worked. If you haven't already stopped the suite and you'd like to give this a go try doing the following:

  1. In the Cylc GUI: Control —> Insert Task(s)…
  2. Set TASK-NAME.CYCLE-POINT=fcm_make_pptransfer.<YYYYMMDDT0000Z>, where <YYYYMMDDT0000Z> is an active cycle point (e.g.19491201T0000Z)
  3. Leave stop-point=POINT blank
  4. Check the "Do not check if a cycle point is valid or not" box
  5. Insert, and wait for the task to complete. You may need to manually trigger it.
  1. Do steps 1-5 for the task-names fcm_make2_pptransfer and pptransfer

Hopefully that will work. You may need to insert the pptransfer task into all the active cycle points, once it's been inserted into the last active cycle point showing in the cylc GUI it should then go on to include it automatically in all new ones.

Regards,
Ros.

comment:24 Changed 8 months ago by mvguarino

Hi Ros,

Thanks! fcm_make_pptransfer and fcm_make2_pptransfer succeeded, now I am waiting for the coupled task to run and see what will happen with the PPTRANSFER task (currently waiting ).

Fingers crossed,

Vittoria

comment:25 Changed 8 months ago by mvguarino

Hi Ros,

The Transfer task is failing with what seems to be again a permission problem:

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:moose_arch
[WARN]  [SUBPROCESS]: Command: rsync -av --stats --rsync-path=mkdir -p /gws/nopw/j04/pmip4_vol1/users/vittoria/u-ba937/19500101T0000Z && rsync /nerc/n02/n02/vittoria/u-ba937/19500101T0000Z/ jasmin-xfer2.ceda.ac.uk:/gws/nopw/j04/pmip4_vol1/users/vittoria/u-ba937/19500101T0000Z
[SUBPROCESS]: Error = 255:
	
            Access to this system is monitored and restricted to
            authorised users.   If you do not have authorisation
            to use  this system,  you should not  proceed beyond
            this point and should disconnect immediately.

            Unauthorised use could lead to prosecution.

    (See also - http://www.stfc.ac.uk/aup)

ssh_askpass: exec(/usr/libexec/openssh/ssh-askpass): No such file or directory
Permission denied (publickey,gssapi-keyex,gssapi-with-mic).
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]

[WARN]  Transfer command failed: rsync -av --stats --rsync-path="mkdir -p /gws/nopw/j04/pmip4_vol1/users/vittoria/u-ba937/19500101T0000Z && rsync" /nerc/n02/n02/vittoria/u-ba937/19500101T0000Z/ jasmin-xfer2.ceda.ac.uk:/gws/nopw/j04/pmip4_vol1/users/vittoria/u-ba937/19500101T0000Z
[ERROR]  transfer.py: Unknown Error - Return Code=255
[FAIL]  Command Terminated
[FAIL] Terminating PostProc...
[FAIL] transfer.py # return-code=1
Received signal ERR
cylc (scheduler - 2018-11-29T17:06:36Z): CRITICAL Task job script received signal ERR at 2018-11-29T17:06:36Z
cylc (scheduler - 2018-11-29T17:06:36Z): CRITICAL failed at 2018-11-29T17:06:36Z

I have checked and I can log into jasmin-xfer2 from dtn02, but I noticed only if I use the -A option to log into dtn02:

ssh -A dtn02.rdf.ac.uk

Otherwise I am asked for passphrase.

Vittoria

Last edited 8 months ago by mvguarino (previous) (diff)

comment:26 Changed 8 months ago by ros

Having to use the -A option (which forwards a existing agent which no good for cylc) implies that your ssh-agent is not running properly on dtn02.

Log in to dtn02 from puma (without using -A), try running ssh-add to add your jasmin key to the agent. I suspect that you may get an error connecting to the agent. If so you will need to remove the ~/.ssh/environment.dtn02 file log out and back in again to start up a new agent and then run ssh-add. I'm hoping that will fix the problem.

Regards,
Ros.

comment:27 Changed 8 months ago by mvguarino

Hi Ros,

There was indeed a problem with the ssh-agent, however restarting it and running ssh-add again didn’t solve the problem. The ssh-agent runs fine within the current session but I am asked for passphrase at each login (coming to think of it this happens to me also when I log into JASMIN from my local unix account).
The only way around it I could find is the following (for future reference in case someone will have the same problem):
I generated a new pair of key (public and private) on dtn02 without passphrase. I copied the new public key into the authorized_keys2* file on JASMIN and run ssh-add on dtn02 adding the new private key.
This seemed to work, but to avoid future similar problems I added to my .bash_profile on dtn02 :

eval $(ssh-agent)
ssh-add ~/.ssh/new_private_key

so the new identity is added at each login.

Everything seems to be working now, the transfer app is moving the desired data to JASMIN!

Thanks for your help,

Vittoria

*when I added the new public key to the authorized_keys file, the latter kept on being overwritten (I don’t know why and by what process) every 10 min or so and the new key disappeared (and the access to the machine with it).

comment:28 Changed 8 months ago by ros

Hi Vittoria,

It's great that you have tried other things out, however, using a passphraseless key is breaching JASMIN security. JASMIN automatically overwrites the authorized_key file every few minutes to be sure that it only contains the key that you have uploaded to the JASMIN portal.

I know you want to get on with transferring the data, but we will still need to work out what is going on here. Before you added the above 2 lines to your .bash_profile on dtn02 can you confirm you did definitely had:

. ~/.ssh/ssh-setup

or

. ~/.ssh/setup

in your .bash_profile depending on what you called the script? The essential point being the "." at the beginning. Missing that off will cause the setup to die on exit.

A colleague has also just pointed out that you have over 80 ssh-agent processes currently running on dtn02 which will cause problems. The eval $(ssh-agent) starts up a new ssh-agent process on every single login.

You will need to kill all these processes and then please try the original ssh-setup again.

Regards,
Ros.

comment:29 Changed 8 months ago by mvguarino

Hi Ros,

ahaha, sorry I will change back my .bash_profile, as that is causing problems.

I did have in my .profile

. ~/.ssh/ssh-setup

However, just like for the $PATH environmental variable that was not working (and I didn't realize so until now). Now that I have added the line above to my .bash_profile it is finally working!

Vittoria

comment:30 Changed 8 months ago by ros

Hi Vittoria,

Phew! Glad it's all working now.
I have updated the setup instructions with some of the gotchas encountered so hopefully make it a little easier for the next person.

Have a good weekend.

Regards,
Ros.

comment:31 Changed 8 months ago by mvguarino

Many thanks for your help,

Vittoria

comment:32 Changed 7 months ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.