Opened 5 months ago

Last modified 11 days ago

#2140 pending help

Request collaboration to install latest UM (vn10.3-vn10.8, fcm, rose and cylc) at UoLeeds

Reported by: markr Owned by: um_support
Priority: normal Component: UM Model
Keywords: install fcm, rose, cylc Cc:
Platform: Other UM Version: 10.7


Hello CMS,
I would like to know if you can advise on a sensible way to install the latest form of the UM on ARC3. Initially we had considered working with Polaris but I have been advised that the future is uncertain on that system. At least with ARC3 it is a Leeds specific HPC and expected to be available over next 5 years.

I spoke with a senior member of ARC at Leeds and they would like us to provide them with a list of foundation software and overview of the steps required to get the service running.

Currently the demand is only from 2 researchers but when it is available I can see it becoming more popular.

Many thanks,

Attachments (1)

GCOM_uoleeds_arc5_test.png (89.4 KB) - added by markr 2 months ago.
GCylc image of progress of GCOM rose stem —group=arc3_intel_test

Download all attachments as: .zip

Change History (40)

comment:1 Changed 4 months ago by grenville


I think we should have a meeting to discuss this. I'm sure it's doable with a small foundation software requirement, but doing this by email won't be efficient.

Rose stem makes later versions simpler to install - that certainly was the case for UM 10.7.

How are you set for some time in the week of April 24th?


comment:2 Changed 4 months ago by markr

Hi Grenville,
okay, that week has 3 days possible: 24, 26, 28. Or the afternoon of 25th for a short meeting.


comment:3 Changed 4 months ago by markr

Hello CMS,
after our telecon I had a look at where the queueing system is defined and see that in the rose jobs that I use the "suite.rc" sets a value for the [job submission] method = pbs and there seems to be a python script In the cylc home directory: /home/fcm/cylc-6.11.4/lib/cylc/batch_sys_handlers/

So I presume it is a matter of converting the suite.rc PBS contexts to SGE context.
I will provide a sample SGE for the ARC3 system.

comment:4 Changed 4 months ago by ros

Hi Mark,

I was going to send my example polaris suite which has this all in, but got diverted.

You need to set; for example:

  batch system = sge
  -l h_rt = 00:01:00

There was a small bug in the cylc code for SGE at cylc-6.x which I have fixed on PUMA.


comment:5 Changed 4 months ago by markr

Hello CMS,
a little more on the accuunt setup at leeds: from Martin Callaghan:

Hi Mark,

The shared accounts are actually project accounts which will be owned by you, 
and the 'ear' identifier is school specific. Within reason, you can have 
anything you like after the 'ear' bit.

If you use an existing project account, we can get this set up on ARC3 very quickly. 
If you want a new one, then it's a (paper) form to fill in and get it countersigned 
by Richard Rigby. I have a small supply of these paper forms.


So I think you will have to continue using *earhum*

If you have a Polaris UM suite then it would be nice to compare to an Archer/Monsoon equivalent as, for my work, I will be converting suites from UKCA team (Mohit Dalvi).

With Juliane's project i would likely have to convert meto internal for ARC3 use. (i.e. PBS to SGE and data paths for ARC3).


Last edited 4 months ago by markr (previous) (diff)

comment:6 Changed 3 months ago by ros

  • Status changed from new to pending

comment:7 Changed 3 months ago by markr

Progress to date:

  1. ARC have enabled the ssh access to from (using IP address)
  1. I have transferred the fcm, rose, cylc from archer umshared software but found some broken links: e.g.

lrwxrwxrwx 1 earmgr EAR 38 Dec 12 13:55 keyword.cfg → ../../../fcm_admin/etc/fcm/keyword.cfg

Do I need a fcm_admin folder?

  1. some folders on archer are very large and I do not want indiscriminately to copy 7TB of files.
  1. the .cylc/global.rc on arc3 appears to "work". I did it first on arc3 then realise I should do it on puma.

cylc get-site-config

stops at [batch systems?]

I am still not sure where to set the batch submission method to "sge".

The work continues…

comment:8 Changed 3 months ago by markr

Have now tried to runthe the "jasmin test suite " see ~markr/roses/arc3_leeds_check

it fails like this:
markr@puma arc3_leeds_check $ rose suite-run
[INFO] create: /home/markr/cylc-run/arc3_leeds_check
[INFO] create: log.20170516T112551Z
[INFO] symlink: log.20170516T112551Z ⇐ log
[INFO] create: log/suite
[INFO] create: log/rose-conf
[INFO] symlink: rose-conf/20170516T122551-run.conf ⇐ log/rose-suite-run.conf
[INFO] symlink: rose-conf/20170516T122551-run.version ⇐ log/rose-suite-run.version
[INFO] create: share
[INFO] create: share/cycle
[INFO] create: work
[INFO] export CYLC_VERSION=6.11.4
[INFO] export ROSE_ORIG_HOST=puma
[INFO] export ROSE_VERSION=2016.11.1
[INFO] install: suite.rc~
[INFO] source: /home/markr/roses/arc3_leeds_check/suite.rc~
[INFO] install: suite.rc
[INFO] 0 suite(s) unregistered.
[INFO] REGISTER arc3_leeds_check: /home/markr/cylc-run/arc3_leeds_check
[INFO] symlink: /home/markr/cylc-run/arc3_leeds_check ⇐ /home/markr/.cylc/arc3_leeds_check
[FAIL] ssh -oBatchMode=yes earmgr@… bash —login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ —name=arc3_leeds_check\ —run=run\ —remote=uuid=e93dff9b-53df-485f-9eab-d5ac4ccf1d4a\' # return-code=255, stderr=
[FAIL] Host key verification failed.

comment:9 Changed 3 months ago by ros

Hi Mark,

  1. You will only really need the fcm keyword.cfg file if you are allowing code checkouts directly on the ARC3 system. However, I would recommend creating the fcm_admin/etc/fcm folder as per on ARCHER but with a blank keyword.cfg file and then as and when needed you can populate with any required repository keywords.
  1. The batch submission system method is set in an individual suite's or rose-stem suite's suite.rc file under
  batch system = sge

Hope that helps

comment:10 Changed 3 months ago by markr

NOTE ssh -Y earmgr@… works passwordless

Then I fixed the suite.rc to be markr for owner and it now fails as:
[INFO] install: suite.rc
[FAIL] ssh -oBatchMode=yes bash —login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ —name=arc3_leeds_check\ —run=run\ —remote=uuid=b268bc47-3c26-48b7-9b95-5a32e43fea6b\' # return-code=255, stderr=
[FAIL] Warning: Permanently added the ECDSA host key for IP address '' to the list of known hosts.
[FAIL] Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased).
markr@puma arc3_leeds_check

comment:11 Changed 3 months ago by markr

The job worked as background (hoorah!) after I change owner back to earmgr (the arc3 identity).
Then I activated SGE as the batch method.

the submission failed. So I looked at submitting the job directly as no error or stdout logfiles had been created:

[earmgr@login2.arc3 01]$ ls
job  job.status
[earmgr@login2.arc3 01]$ qsub job
Unable to run job: must specify a value for h_rt (job runtime)

So I need to revise the suite.rc or appropriate rose component to set up the minmum settings for SGE to work with this basic case.

comment:12 Changed 3 months ago by markr

Hi Annette, (I just realised I should copy this message to the ticket)

  1. On puma I have used your jasmin test suite. See ~markr/roses/arc3_leeds_check.

I have a rose.conf in ~/.metomi with the recommended settings.

markr@puma arc3_leeds_check $ cat /home/markr/.metomi/rose.conf


the suite runs okay but the files seem to be created on my arc3 account $HOME

  1. On ARC3 I set the following in my .bashrc

UMDIR=${HOME}/umshared (this will change to ~/earumsh; when I get my installation working )



MCYLC=${DATADIR}/cylc-run (so I can get there quickly)

  1. The result of the run is under $HOME/cylc-run/arc3_leeds_check and there is an extra directory in "work" that is


  1. To see my account you would have to apply for ARC3 account from here:

I did send ARC help a message several weeks ago that I would like help from your team and mentioned the polaris accounts.

Perhaps you can name me as principal investigator as Technical Head of CEMAC. Otherwise it is to support Juliane Schwendike and her WCSSP work using the Unified model.

Also you would be a "NEW USER: ARC2 and ARC3 account" (their home dirs are common).


On 23/05/17 16:58, Annette Osprey wrote:

Hi Mark,

For now those lines should go in your $HOME/.metomi/rose.conf file on PUMA, but once you are happy with the configuration we can put it in the central configuration file on PUMA.

No need to worry about ross, as I don't think you need any configuration on the HPC side. In fact ross is an old way of managing the installations and the Met O have newer scripts but we haven't had a chance to update yet.

If this definitely isn't working, we can take a look, but I am not sure if I can log into your system. I did have a Polaris account some time ago…?


On 23/05/17 15:50, Mark Richardson wrote:

Hi Ros, annette.

Not yet had any joy with the trying to put "work" onto the unlimited volatile disks of arc3.

Ros, I tried creating a the rose.conf but get confused about where this file should be. These did not have an effect.


Is it in ${HOME}/metomi, ${CYLC_HOME}/etc or the rose suite rose-suite.conf or another place entirely?

I read a lot of the CMS online info and wondered if I need to work with the "ross" directories as well?

The ideal configuration will work with path set to read ~earumsh/software (equivalent to umshared on archer, I think) and then work with the case in


but keep logs and small files in


I tried to follow the working directory in the cylc python source and thought it was CYLC_SUITE_WORK_DIR or something from job_conf[].



Dr. Mark Richardson
Technical Head of CEMAC (
Room 10.115 School of Earth and Environment
University of Leeds

comment:13 Changed 3 months ago by annette

Hi Mark,

Just to double check are you definitely submitting a clean run, i.e. deleting the ${HOME}/cylc-run/arc3_leeds_check directory on ARC, or running with rose suite-run --new? If the cylc-run directories are already in the wrong place it won't recreate them.

Just re-reading your email thread with Ros… which option did you decide to go with?

  1. Put the whole cylc-run/<suite-id>/ directory on the fast disk (/nobackup) or
  2. Just put the cylc-run/<suite-id>/share/ and cylc-run/<suite-id>/work/ directories on the fast disk.

Option 1. is what we do on Archer and option 2. is what we do on Monsoon so I should have thought we could get it to work.

In the suite.rc file you have the line:

work sub-directory = $DATADIR

which I guess is what is creating that directory cylc-run/<suite-id>/work/nobackup/, but I am wondering if this is what you meant to do? The work sub-directory directive is just a way of creating a shared work space for multiple tasks to use (otherwise each task runs from its own <taskname> directory.


comment:14 Changed 3 months ago by markr

Hi Annette, Ros,
progress so far: I can now run and get "work and "share" in the /nobackup LUSTRE file system.

I am sorry if the evidence on puma is confusing. The suite is a bit volatile while I tried some other things.

I wanted a variation of option 2 i.e. the case should run on the lustre file system

/nobackup/earmgr/cylc-run/caseID/work/ etc..

The build should be on the non-lustre disk (speed of compilation as optimisation involves creating and deleting of small files - bad thing for LUSTRE).

${HOME}/cylc-run/caseID/share/fcm_make etc

However, it looks like the prefixes for those directories have the common HOME hard-wired in


SO I will go with option 2 for now.

Odd that work and shared still appear in the cylc-run home.

Next I have to modify a UM suite to run on the arc3 system.

Thank you for the guidance so far…

comment:15 Changed 3 months ago by annette

Hi Mark,

Yes share/ and work/ will still be under the cylc-run directory. However there is a way to get the build to use /home (or whatever) for compilation, even though the share/ and work/ directories are on /nobackup. We do this on Monsoon, and have tested on Archer but it didn't really help performance for us. I can't remember the details off-hand but I think Ros knows (she is out today), or I will dig out the info for you.


comment:16 Changed 3 months ago by markr

Hi all,
I followed a lot of the guidance on GCOM and now I find the rose stem for that fails:

 markr@puma 01 $ more job.err
[FAIL] config-file=/home/markr/cylc-run/vn6.2_arc3_leeds/work/1/fcm_make_arc3_mpp/fcm-make.cfg:2
[FAIL] config-file= - puma:/home/markr/DevWork/GCOM/vn6.2_arc3_leeds/fcm-make/gcom.cfg
[FAIL] puma:/home/markr/DevWork/GCOM/vn6.2_arc3_leeds/fcm-make/gcom.cfg: cannot load config file
[FAIL] puma:/home/markr/DevWork/GCOM/vn6.2_arc3_leeds/fcm-make/gcom.cfg: cannot be read
[FAIL] Host key verification failed.

You can see the evidence in ~/cylc_run/vn6.2_arc3_leeds

I realise this is the 2-stage extract-mirror-build but I have not really changed much of the site/uoleeds/suite.rc other then rename "archer" to "arc3" and in places uoleeds_arc3.
The latter because I hijacked the uoe_emps_intel_mpp machine file.


comment:17 Changed 3 months ago by markr

can you help me understand why my basic rose job arc3_leeds is failing to submit with this message in SGE

error reason          1:      05/26/2017 14:43:16 [256785:104944]: error: can't open output file "/home/ufaserv1_c/earmgr/cylc-run/arc3_leeds_check/log.20170526T130559Z/job/1/initialise/02/cylc-run/arc3_leeds_check/log/job/1/initialise/02/job.out": No such file or directory

it looks like the log path is being concatenated twice.

comment:18 Changed 3 months ago by annette


You confirmed above that you wanted just the share and work sub-directories on your fast disk ($DATADIR). And this is what you have specified in your rose.conf file.

Given this you do not need the following line in your suite.rc file:

initial scripting = "export HOME=$DATADIR"

This line is only needed when you are putting the whole cylc-run directory on $DATADIR, which is not what you are doing here.

Remove the line and try again.


comment:19 Changed 3 months ago by markr

Hi Annette,
I no longer have the DATADIR override. Only for work and share.
This error (quoted below) is about the make of GCOM from a branch I made: vn6.2_arc3_leeds.
Must I manually delete things to trigger a fresh build attempt?

Previous error has not yet been solved (not finding gcom.cfg) - but need to try it again.

markr@puma rose-stem $ rose stem --group=arc3_intel_build
[INFO] Source tree /home/markr/DevWork/GCOM/vn6.2_arc3_leeds added as branch
[INFO] Will run suite from /home/markr/DevWork/GCOM/vn6.2_arc3_leeds/rose-stem
[FAIL] Suite "vn6.2_arc3_leeds" has running processes on:
[FAIL] Try "rose suite-shutdown --name=vn6.2_arc3_leeds" first?
markr@puma rose-stem $ rose suite-shutdown --name=vn6.2_arc3_leeds
Really shutdown vn6.2_arc3_leeds at [y or n (default)] y
security reasons
[FAIL] cylc shutdown vn6.2_arc3_leeds --force # return-code=1
markr@puma rose-stem $

comment:20 Changed 3 months ago by annette

Hi Mark,

So does your basic test suite work OK now?

To force shutdown of a suite you sometimes have to kill rogue processes - try following the instructions here:

In reference to your gcom.cfg error in comment:16, can you try logging into puma from the command-line with a simple:

ssh puma

i) check this works, and ii) it may prompt you to add puma to your known hosts, which it can't do non-interactively.


comment:21 Changed 2 months ago by markr

The u-am554 case gets some way (all?) through fcm_make2 but the fails.
No clear reason as the build log seems to complete. However the build dir is empty on ARC.
It looks like the preprocessed source is in place.

comment:22 Changed 2 months ago by markr

I cannot find any fail log or message other then the colour of the Suite in Gcylc.
On arc3:


Similarly on puma:


comment:23 Changed 2 months ago by grenville


I note that u-am554 uses ncas-xc30-cce for its config (platform_config_dir) - this is for ARCHER specifically. This may not be the cause of the lack of output (prob not), but you won't manage a build with this config. You'll need to add one specific to ARC3.


comment:24 Changed 2 months ago by markr

I am now reviewing the notes from Annette and see I have skipped steps 3 and 4 i.e. build GCOM and then configure the UM for uoleeds site with a UM branch.
I have a GCOM branch and was part way through that.
I am now back to building GCOM.

Must walk before I can run. Step-by-step.


comment:25 Changed 2 months ago by markr

Now I am back at this failure that distracted me to investigate if rose-cylc task on remote was working.
However, I believe at this stage the tasks are running on the suite-host (local t puma).

[FAIL] config-file=/home/markr/cylc-run/vn6.2_arc3_leeds/work/1/fcm_make_arc3_intel_serial/fcm-make.cfg:2
[FAIL] config-file= - puma:/home/markr/DevWork/GCOM/vn6.2_arc3_leeds/fcm-make/gcom.cfg
[FAIL] puma:/home/markr/DevWork/GCOM/vn6.2_arc3_leeds/fcm-make/gcom.cfg: cannot load config file
[FAIL] puma:/home/markr/DevWork/GCOM/vn6.2_arc3_leeds/fcm-make/gcom.cfg: cannot be read
[FAIL] Host key verification failed.

[FAIL] fcm make -f /home/markr/cylc-run/vn6.2_arc3_leeds/work/1/fcm_make_arc3_intel_serial/fcm-make.cfg -C /home/markr/cylc-run/vn6.2_arc3_leeds/share/uoleeds_arc3_ifort_serial -j 4 mirror.prop{}=2 # return-code=255
Received signal ERR
cylc (scheduler - 2017-06-06T10:35:21+01): CRITICAL Task job script received signal ERR at 2017-06-06T10:35:21+01
cylc (scheduler - 2017-06-06T10:35:21+01): CRITICAL failed at 2017-06-06T10:35:21+01

What task needs to see that cfg file and where is it running?

comment:26 follow-up: Changed 2 months ago by annette


Did you ssh into puma from puma as I suggested in comment:20? Please confirm whether this gets you any further.


Changed 2 months ago by markr

GCylc image of progress of GCOM rose stem —group=arc3_intel_test

comment:27 Changed 2 months ago by markr

Some progress, but now GCOM tests fail to find gcom.exe
Also there are no "build" dirs in in the expected directories.

comment:28 in reply to: ↑ 26 Changed 2 months ago by markr

Replying to annette:


Did you ssh into puma from puma as I suggested in comment:20? Please confirm whether this gets you any further.


Okay done that now. had to answer yes.
Also moving onto GCOM build.
Not confident of site/uoleeds_arc3/suite.rc changes

comment:29 Changed 2 months ago by annette


From your logs, I don't think the builds have actually done anything, they have all completed suspiciously quickly. And you say there are no build directories in the expected places.

I will look at your changes and see if I can spot anything.


comment:30 Changed 2 months ago by annette

Hi Mark,

Looking at some of the files in your cylc-run, I don't think the mirror is working correctly.

Has the code been copied over to arc3? i.e. on arc3, do you have a directory like:


And if you look in there can you see the gcom code?

My hypothesis is that you don't have this…

And I think that you need to add these lines to your fcm-make cfg files (uoleeds_arc3_ifort_openmpi.cfg etc): = ${ROSE_TASK_MIRROR_TARGET}
mirror.prop{config-file.steps} = $REMOTE_ACTION

Sites might not have these if they are not doing a mirror (because they are submitting suites from the same system so don't need to copy the code).

Also I would just test out one thing at a time, otherwise it can be hard to see what is going on. So maybe just the intel build:

rose stem --group=arc3_intel_build

Then check that it has actually copied the code over (look in the preprocess directory above) and built something (look in build/lib), before running the tests.


comment:31 Changed 2 months ago by markr

Hi Annette,
I appreciate the difficulty of helping "blind". No preprocess directory. BTW I copied uoe_emps_ifort_openmp.cfg for ARC3. I will have to look a bit closer there too.

This is on arc3:

[earmgr@login1.arc3 share]$ ls -ltr /nobackup/earmgr/cylc-run/vn6.2_arc3_leeds/share/uoleeds_arc3_ifort_openmpi/
total 12
drwxr-xr-x 3 earmgr EAR 4096 Jun  6 11:07 extract
-rw-r--r-- 1 earmgr EAR   22 Jun  6 11:39 fcm-make2.cfg
-rw-r--r-- 1 earmgr EAR 1014 Jun  6 11:39 fcm-make2.cfg.orig
lrwxrwxrwx 1 earmgr EAR   32 Jun  6 11:39 fcm-make2-on-success.cfg -> .fcm-make2/config-on-success.cfg
lrwxrwxrwx 1 earmgr EAR   14 Jun  6 11:39 fcm-make2.log -> .fcm-make2/log
lrwxrwxrwx 1 earmgr EAR   31 Jun  6 11:39 fcm-make2-as-parsed.cfg -> .fcm-make2/config-as-parsed.cfg

comment:32 Changed 2 months ago by annette

Hi Mark,

Looking in rose-stem for uoe, emps is a 1-step build. Please try adding in those mirror lines and retry.


comment:33 Changed 2 months ago by markr

more progress now failing to buid with mpicc inking error.
See openmpi build log

Got to go to a meeting now.
More later.

Can I exercise the build within the directories on arc3 and see if I can get the right environment to diagnose the failure?

some sort of comnad like "cylc task run build" ?

comment:34 Changed 2 months ago by annette

Hi Mark,

You should be able to bypass rose/cylc entirely… Go into the uoleeds_arc3_ifort_openmpi directory, then run:

fcm make -f fcm-make2.cfg

Set any build options in fcm-make2.cfg, and you can see the build commands in fcm-make2.log.

If you don't want to bypass rose/cylc, then go into the log directory and drill down until you get to the job run script for that task. You can submit this manually to the queue or run on the command-line.


comment:35 Changed 2 months ago by markr

The make -f fcm-make2.cfg id not work: because I just realised I forgot to use "fcm".

Meanwhile I await this:
I realise the link line has -lmpl (I think and MVAPICH library) so I removed it from the machines files in :

Now running the rose stem again fro the GCOM branch.
Will the the fcm make command after this rose stem exits.

comment:36 Changed 2 months ago by markr

Still getting this for the C code gc_abort.c :

[FAIL] /apps/developers/libraries/openmpi/2.0.2/3/intel-17.0.1/bin/mpicc -E -I./include /nobackup/earmgr/cylc-run/vn6.2_arc3_leeds/share/uoleeds_arc3_ifort_openmpi/extract/gcom/gc/gc__abort.c # rc=127
[FAIL] /apps/developers/libraries/openmpi/2.0.2/3/intel-17.0.1/bin/mpicc: error while loading shared libraries: cannot open shared object file: No such file or directory
[FAIL] process    0.0 ! gcom/gc/gc__abort.c  <- gcom/gc/gc__abort.c

comment:37 Changed 2 months ago by markr

The queues on ARC3 just got busier as they have turned off ARC1.
So the gcylc shows a failed submission where actually the job is still waiting in the queue.
Perhaps something about SGE that I have yet to configure.

Meanwhile the build failed to find a library that probably is related to the LD_LIBRARY_PATH.


comment:38 Changed 8 weeks ago by markr

Hi Ros,
I see that you have now got access to arc3 through the remote access gateway.
I find

is a useful site for the ARC systems. It has 24c per node and some are large memory nodes.
Also there are some K80 nvidia nodes which might be fun if we had a GPU version of the UM.

We created a user account for shared access (earumsh) and I have put what I was working with in that directory.
Let me know if it looks okay.
If you supply an ssh key then you could do work there too.

Let me know how you want to proceed. Whether to do it all independently or coordinate with me.
NOTE by default $HOME is closed even to groups. I have opened up group read/access to both earumsh and earmgr.


comment:39 Changed 11 days ago by markr

Now I am about to start wrking with a branch to configure the UM - I will use vn10.8.
I notice that now it uses gcom6.3. I think I will have to go back and repeat the GCOM work in a vn6.3 branch.

gcom branch: vn6.2_arc3_leeds

um branch: vn10.8_uoleeds_arc3_intel_cfg

Note: See TracTickets for help on using tickets.