wiki:RoseCylc/Hints

Useful information for running with Rose

See also:

Hints and tips

Switching versions of Rose and/or cylc

Use these variables:

export CYLC_VERSION=x.y.z
export ROSE_VERSION=YYYY.MM.DD

Note: You should use same versions of Rose and Cylc on puma and Archer.

Viewing the suite run graph without running

When developing suites, it can be useful to check what the run graph looks like after jinja evaluation etc. To do this without running the suite:

rose suite-run -i --name=puma-aa045  # install suite in cylc db only
cylc graph puma-aa045                # view graph in browser       

To just view the dependencies on the command line:

cylc ls -t puma-aa045

Setting the default size of the rose edit window

Setting the default size of the rose edit window and the width of the rose edit left hand menu pane can be very helpful.

Edit ~/.metomi/rose.conf

Adding the following information to the file sets the default size and width of the rose config-edit (rose edit) window:

[rose-config-edit]
SIZE_WINDOW = (1100, 650)
WIDTH_TREE_PANEL = 400

For details of further customisations that can be made to the rose edit window see: http://metomi.github.io/rose/doc/rose-rug-config-edit.html#customisation

Launching Rose commands

It is possible to launch many of the Rose tools from the various GUIs. For example you can run or edit suites from rosie go, run suites from rose edit, and view log files from rose suite-gcontrol whilst the suite is running.

When running rose from the command line make sure to run from the appropriate roses/ directory or append the suite name using --name=puma-aa015, e.g.

rose suite-shutdown
 --name=puma-aa015

Be careful though because rose suite-run --name=puma-aa0015 works differently. It would run the suite in the current directory but re-name it puma-aa015.

Stop archiving of log files

By default, when a suite is run, the log files from the previous run will be tarred up. To avoid this run rose suite-run with the flag --no-log-archive.

Diff'ing suites

There is no formal mechanism for this as yet. But there is a tool rose config-dump which sort all of the app files in the suite into a common format, which then allows for diff to be run on the command-line between suite files. For more info see: http://metomi.github.io/rose/doc/rose-command.html#rose-config-dump

Adding UM user diagnostics

This works in a different way to the old UMUI and no longer uses user-STASHmaster files.

Instead the STASHmaster file is held in the UM trunk. To make changes, place your modified version in the file/ subdirectory of the app, e.g:

~roses/puma-aa045/app/um/file/STASHmaster

Passing arguments to fcm_make

Rose deals with fcm_make as a special app, see: http://metomi.github.io/rose/doc/rose-rug-task-run.html#rose-task-run.built-in-app.fcm_make

To pass arguments, such as -vvv for full verbose output:

  • Set the environment variable ROSE_TASK_OPTIONS=-vvv
  • Or add args=-vvv at the top of the fcm_make rose-app.conf file.

Copying suites between repositories

Rosie allows you to easily copy suites between repositories.

Run rosie copy specifying the repository for the new suite with the --prefix flag. For example to copy a suite (puma-aa125) from the puma repository to the MOSRS u repository you would run:

rosie copy --prefix=u puma-aa125

Setting up rose host-select archer

Some suites use the command rose host-select to choose a machine to submit the suite to. This can be used to select the least-loaded server, but for ARCHER we use this to mitigate against times when some of the ARCHER login nodes are down.

To get rose host-select archer to work there is some setup required. (Note: You can just replace the rose host-select line in your suite.rc or site/archer.rc file with the name of the host, but you won't get the benefits.)

i) If your PUMA and ARCHER usernames are the same skip to step ii). Otherwise you will need to configure your SSH settings so that it knows your ARCHER username. Open the file ~/.ssh/config and add the following lines, replacing <archer-username> with your username:

Host login*.archer.ac.uk
   User <archer-username> 

To check this is working correctly, try to login to ARCHER without your username:

ssh login.archer.ac.uk

ii) Next run the following script which logs into each of the hosts to add them to your ~/.ssh/known_hosts file. (Otherwise rose host-select will not be able to connect).

~um/um-training/setup-archer-hosts 

Note that if the script can't connect to one of the hosts, for example because it is down, rose host-select won't be able to access it. This shouldn't matter too much if it is just one host, but you can add the host at a later time by re-running the script.

To check this has worked correctly, run the command: rose host-select archer and it should return an active host.

Mail notifications

You can add cylc event handlers to your suite to email you when tasks run or fail. See the cylc documentation for more information: https://cylc.github.io/cylc/html/multi/cug-htmlse12.html#12.15

You will need to set your email address in your cylc configuration file. Open or create a new file ~/.cylc/global.rc and add the following lines, using your own email address in place of dummy-email:

[task events] 
  mail to = dummy-email

Then add even notifications to your suite's suite.rc file, for example:

        [[[events]]]
            mail events = succeeded, failed

These can go under [runtime] -> [[root]] or a specific task definition. For a full set of notifications see the documentation pointed to above.

Important: Rose notifications will not work on PUMA and are no longer recommended for use. Rose notifications have the form rose suite-hook, and any instances should be removed from the suite.

Merging in changes from another suite

You may have taken a copy of a suite, but there have been subsequent changes that you wish to include. FCM won't allow you to merge in changes from another suite, but you can do it with a direct svn command. You will need to know the full svn URL for the suite containing the changes and the revision number (use -c) or range (use -r), for example:

svn merge -c 23406 https://code.metoffice.gov.uk/svn/roses-u/a/a/7/7/4/trunk
svn merge -r 21186:23406  https://code.metoffice.gov.uk/svn/roses-u/a/a/7/7/4/trunk

If there are any clashes, you will need to resolve them.

Check which suites you have running

For a command-line listing of your running suite:

rose suite-scan 

To see a graphical summary status of all of your suites, use:

cylc gscan & 

You can then click on each of the suites to open the usual cylc suite control GUI.

Troubleshooting common errors

Rosie go asks for "username for u"

By default rosie is set up to load suites from the local puma repository and the Met Office Science Repository Service (MOSRS). If your MOSRS password isn't cached, Rosie will prompt for it at startup. Clicking 'cancel' then produces an error:

Traceback (most recent call last):
  File "/home/fcm/rose-2015.04.1/lib/python/rosie/browser/main.py", line 994, in handle_update_treemodel_local_status
    self.display_box.update_treemodel_local_status(local_suites,
AttributeError: 'MainWindow' object has no attribute 'display_box'
get_known_keys: {}

There are two potential solutions:

  1. Re-cache your MOSRS password
  1. Tell Rosie to only load puma suites:

rosie go --prefix=puma

Users that don't have a MOSRS account may wish to set this up as an alias.

Unable to submit jobs (MONSooN)

The suite will fail straight away and the following error appears in the log/suite/err file:

Host key verification failed.
2015-01-21T14:56:23Z ERROR - [fcm_make.1] -Failed to construct job submission command
2015-01-21T14:56:23Z WARNING - Command '['ssh', '-oBatchMode=yes', '-oConnectTimeout=10', 'exvmsrose
.monsoon-metoffice.co.uk', 'mkdir -p "$HOME/cylc-run/nemovar_build" "$HOME/cylc-run/nemovar_build/lo
g/job"']' returned non-zero exit status 255
2015-01-21T14:56:23Z ERROR - [fcm_make.1] -submission failed 

This is because of an inability to ssh into the Rose VM from the Cylc VM interactively.

To solve, log in to the Cylc VM and then back to the Rose VM specifying the full paths, to add these to the known_hosts file.

  1. Check whether exvmscylc or exvmsrose appear in the known_hosts file already. If so delete these entries, especially if you accessed the VMs before their rebuild:
    cd .ssh
    mv known_hosts known_hosts.OLD
    sed '/^exvmsrose/d;/exvmscylc/d' known_hosts.OLD > known_hosts
    
  1. Now from exvmsrose, ssh into exvmscylc using the full path:
    ssh exvmscylc.monsoon-metoffice.co.uk
    
    This should provide output something like this:
    The authenticity of host 'exvmscylc.monsoon-metoffice.co.uk (10.168.64.4)' can't be established.
    RSA key fingerprint is 98:c8:5e:b9:b3:d2:2f:c4:9c:89:78:08:d6:78:70:3a.
    Are you sure you want to continue connecting (yes/no)? 
    
    Type yes.
  1. Now from exvmscylc, log in to exvmsrose using the full path:
    ssh exvmsrose.monsoon-metoffice.co.uk
    
    And again type yes at the prompt.
  1. Type exit to get back to the Rose VM, then ssh into exvmsrose again, and this should succeed without any interative prompts.
  1. Now type exit twice to get back to the original Rose terminal. And try re-submitting the rose suite.

Unable to submit jobs; can't find cylc (MONSooN)

rose suite-run on exvmsrose fails unable to find cylc

exvmsrose$ rose suite-run
...
[FAIL] WARNING:
[FAIL] This computer is provided for the processing of official information.
[FAIL] Unauthorised access described in Met Office SyOps may constitute a criminal offence.
[FAIL] All activity on the system is liable to monitoring.
[FAIL] bash: line 11: cylc: command not found 

Ensure you have set up paths to FCM, Rose, Cylc correctly: See https://code.metoffice.gov.uk/trac/home/wiki/AuthenticationCaching#Monsoon. In particular ensure that PATH=$PATH:~fcm/bin is set at the top of the appropriate file and that the [[ $- != *i* ]] && return section is at the end.

No gcylc window

When submitting a job, no gcylc window appears.

Sometimes the gui is slow to load. If it does not appear at all however, check that you have X11 forwarding set up from your initial location and the lander.

To do so ssh with the -Y option or alternatively, append the following line to your ~/.ssh/config file:

Host *
ForwardX11 yes

Problems shutting down suites

Types of shutdown

By default when you try to shutdown a suite, cylc will wait for any currently running tasks to finish before stopping, which may not be what you want to do. You can also tell cylc to kill any active processes or ignore running processes and force the suite to shutdown anyway. The latter is what you will need to do if the suite has got stuck:

rose suite-shutdown -- --now

To access these options in the cylc GUI, go to "Control" → "Stop Suite". See also rose help suite-shutdown for further details.

Forcing shutdown

Sometimes after trying to shutdown a suite, it will still appear to be running.

First make sure you have used the correct shutdown command and aren't waiting for any unfinished tasks (see above). It can take cylc a little while to shut down everything properly, so be patient and give it a few minutes.

If it still appears to be running (for example you get an error when you try to re-start the suite), you may have to do the following:

  • Manually kill the active processes:
    Get a list of processes associated with the suite. For example, for suite u-ak194 I would run:
    puma u-ak194$ ps -lfu annette  | grep u-ak194
    0 S annette   2735  5230  0  80   0 -  1322 pipe_w 11:53 pts/157  00:00:00 grep u-ak194
    1 S annette  18713     1  0  80   0 - 59140 -      08:41 ?        00:00:08 python /home/fcm/cylc-6.11.4/bin/cylc-restart u-ak194
    1 S annette  18714 18713  0  80   0 - 28132 futex_ 08:41 ?        00:00:00 python /home/fcm/cylc-6.11.4/bin/cylc-restart u-ak194
    1 S annette  18715 18713  0  80   0 - 28132 futex_ 08:41 ?        00:00:00 python /home/fcm/cylc-6.11.4/bin/cylc-restart u-ak194
    1 S annette  18717 18713  0  80   0 - 28132 futex_ 08:41 ?        00:00:00 python /home/fcm/cylc-6.11.4/bin/cylc-restart u-ak194
    1 S annette  18718 18713  0  80   0 - 28132 pipe_w 08:41 ?        00:00:00 python /home/fcm/cylc-6.11.4/bin/cylc-restart u-ak194
    
    This gives a list of processes. The number in the 4h column is the process-id. Use this to kill each of the processes, eg:
    kill -9 18713
    
  • Delete the port file:
    This lives under ~/.cylc/ports/. For example:
    rm ~/.cylc/ports/u-ak194
    

Monsoon

On Monsoon, you may need to log in to the cylc VM to force a suite shutdown.

You may occasionally see that a rose suite looks like it is running, i.e. rose suite-scan gives something like:

puma-aa046 gmslis@exvmscylc:7767 

Or trying to re-run the suite gives an error rose suite-run

[FAIL] Suite "puma-aa046" may still be running.
[FAIL] Host "exvmscylc" has process:
[FAIL]     9468 python /home/fcm/cylc-6.1.2/bin/cylc-run puma-aa046
[FAIL]     9469 python /home/fcm/cylc-6.1.2/bin/cylc-run puma-aa046
[FAIL] Try "rose suite-shutdown --name=puma-aa046" first? 

However, when trying to shutdown the suite, rose suite-stop reports that the suite isn't running:

Really shutdown puma-aa046 at exvmscylc? [y/n] y
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
'ERROR, remote port file not found' 

This is due to orphaned tasks on the Cylc VM, which can occur when exvmscylc and exvmsrose cannot communicate non-interactively.

To solve, log in to exvmscylc, and run cylc scan, this should show running tasks. To stop these, type:

cylc shutdown --now

This may report something like "Command queued", but re-running cylc scan will show that the tasks are now finished.

Device or resource busy when running suite

Unable to run suite.

exmsrose puma-aa045$ rose suite-run
[INFO] create: log.20150121T164500Z
[INFO] delete: log
[INFO] symlink: log.20150121T164500Z <= log
[INFO] log.20150121T163546Z.tar.gz <= log.20150121T163546Z
[FAIL] [Errno 16] Device or resource busy: 'log.20150121T163546Z/job/1/fcm_make/01/.nfs0000000000451b5d00000065'

You have one of the output files open somewhere, which means rose can't archive the old output. Close the file.

Warning when opening gcylc

A warning appears when the Rose/cylc run-time task manager, called gcylc, opens:

ParseError: File not found: /home/annette/.cylc/gcylc.rc
WARNING: user config parsing failed (continuing)

This is harmless but to avoid create an empty file in your home space:

touch ~/.cylc/gcylc.rc

.vimrc error with fcm commit

When trying to commit changes to a rose suite the following error occurs:

exmsrose puma-aa045$ fcm commit
[info] vi: starting commit message editor...
Error detected while processing /home/aospre/.vimrc:
line    5:
E518: Unknown option: foldlevelstart=99
Press ENTER or type command to continue
[FAIL] log message is empty

This error occurs with the Cylc syntax highlighting for Vim. Changing the default FCM editor to be vim rather than vi stops this error.

In your .profile add the following line:

export SVN_EDITOR=vim

Jinja error from rose suite-run

After editing the suite, a cryptic Jinja error message appears from rose suite-run:

[FAIL] cylc validate -v --strict puma-aa069 # return-code=1, stderr=
[FAIL] Jinja2 Error:
[FAIL]   File "<unknown>", line 58, in template
[FAIL] TemplateSyntaxError: expected token 'end of print statement', got '='

This is caused by some error in the suite.rc file caused by the Jinja syntax or Rose variables.

To debug, go to ~/cylc-run/<suite-name>, open the suite.rc file and navigate to the line number causing the error.

If the suite.rc file uses includes, then to generate the parsed file run:

cylc view -i <suite-name>

After identifying the error, fix in the original suite.rc or rose-suite.conf file in the roses directory. Editing the file in the cylc-run directory will have no effect!

Can't view output in Rose bush

For MONSooN:

Note that, running rose suite-log doesn't work.

To access Rose bush from exvmsrose run:

firefox http://localhost/rose-bush

For suites submitted to ARCHER from PUMA:

Running rose suite-log should work properly.

Alternatively in a browser navigate to http://puma.nerc.ac.uk/rose-bush

Other problems running rose-bush:

Sometimes clicking on the log file in Rose bush gives a "403 Forbidden" error with a Python traceback:

Traceback (most recent call last):
  File "/usr/local/python/lib/python2.6/site-packages/cherrypy/_cprequest.py", line 606, in respond
    cherrypy.response.body = self.handler()
  File "/usr/local/python/lib/python2.6/site-packages/cherrypy/_cpdispatch.py", line 25, in __call__
    return self.callable(*self.args, **self.kwargs)
  File "/home/fcm/rose/lib/python/rose/bush.py", line 301, in view
    f_name = self._get_user_suite_dir(user, suite, path)
  File "/home/fcm/rose/lib/python/rose/bush.py", line 472, in _get_user_suite_dir
    *paths))
  File "/home/fcm/rose/lib/python/rose/bush.py", line 446, in _check_dir_access
    raise cherrypy.HTTPError(403)
HTTPError: (403, None)

This is because the Rose bush server does not have permissions to read the log file. Navigate to your log files on the command line then manually add read permissions:

cd ~/cylc-run/<suite-id>/log
chmod -R a+r *

Then refresh the browser and the file should appear.

So that future log files have the correct permission add the line -W umask = 0022 to your suite.rc under the [[HPC]] [[[directives][]] namespace:

    [[XC30]]
...
        [[[directives]]]
...
            -W umask = 0022

Sometimes log files may also not show up if connection has been lost between PUMA and ARCHER. In this case log on to ARCHER, and navigate to the suite output directory:

chmod -R a+r ~/cylc-run/<suite-id>/log/job

Then browse the log files manually. You should also check your ssh agent on PUMA is still active.

rose host-select archer error

If your PUMA and ARCHER usernames are different you may see the following when submitting a suite:

RosePopenError?: bash -ec H=$(rose\host-select\archer);\echo\$H # return-code=1, stderr=
[WARN] login5.archer.ac.uk: (ssh failed)
[WARN] login7.archer.ac.uk: (ssh failed)
…..
[FAIL] No hosts selected

This is because rose is running a command called rose host-select to chose a machine to submit the suite to, and this command needs to know your ARCHER username. To set this follow these instructions: http://cms.ncas.ac.uk/wiki/RoseCylc/Hints#Settinguprosehost-selectarcher

Unable to access STASHmaster from branch on ARCHER

Some suites may reference files held in the repository for use at runtime. The most common example of this is the STASHmaster file. To make a change to the STASHmaster file requires editing the file in a branch and setting the path to this in the suite. However the method described in the instructions below will not work on ARCHER:

https://code.metoffice.gov.uk/doc/um/latest/um-training/stashmaster.html

You will get an error like:

[FAIL] file:STASHmaster=source=fcm:um.xm_tr/rose-meta/um-atmos/HEAD/etc/stash/STASHmaster@31236: bad or missing value
Received signal ERR

This is because the job tries to access the repository from the ARCHER queues, which will not work. Note this will work on the XCS machines, so if you are porting a suite, it may have something like this in.

The solution is to make the suite extract the file on PUMA and then copy over to ARCHER with the other suite files.

You will have a line in app/um/rose-app.conf such as:

[file:STASHmaster]
source=fcm:um.xm_tr/rose-meta/um-atmos/HEAD/etc/stash/STASHmaster@31236

This should be removed, and the following added to the rose-suite.conf file:

[file:app/um/file/STASHmaster] 
source=fcm:um.xm_tr/rose-meta/um-atmos/HEAD/etc/stash/STASHmaster@31236

Note the source= line is identical but the target [file:] line needs to reflect the intended location in the suite directory structure.

This will extract the file on PUMA and install it to the app/um directory on ARCHER which will have exactly the same affect as extracting on ARCHER directly would have done.

This method will work for any similar files.

Last modified 6 weeks ago Last modified on 04/07/17 11:46:43