Opened 7 weeks ago

Last modified 6 days ago

#2140 pending help

Request collaboration to install latest UM (vn10.3-vn10.8, fcm, rose and cylc) at UoLeeds

Reported by: markr Owned by: um_support
Priority: normal Component: UM Model
Keywords: install fcm, rose, cylc Cc:
Platform: Other UM Version: 10.7

Description

Hello CMS,
I would like to know if you can advise on a sensible way to install the latest form of the UM on ARC3. Initially we had considered working with Polaris but I have been advised that the future is uncertain on that system. At least with ARC3 it is a Leeds specific HPC and expected to be available over next 5 years.

I spoke with a senior member of ARC at Leeds and they would like us to provide them with a list of foundation software and overview of the steps required to get the service running.

Currently the demand is only from 2 researchers but when it is available I can see it becoming more popular.

Many thanks,
Mark

Change History (11)

comment:1 Changed 7 weeks ago by grenville

Mark

I think we should have a meeting to discuss this. I'm sure it's doable with a small foundation software requirement, but doing this by email won't be efficient.

Rose stem makes later versions simpler to install - that certainly was the case for UM 10.7.

How are you set for some time in the week of April 24th?

Grenville

comment:2 Changed 7 weeks ago by markr

Hi Grenville,
okay, that week has 3 days possible: 24, 26, 28. Or the afternoon of 25th for a short meeting.

Mark

comment:3 Changed 3 weeks ago by markr

Hello CMS,
after our telecon I had a look at where the queueing system is defined and see that in the rose jobs that I use the "suite.rc" sets a value for the [job submission] method = pbs and there seems to be a python script pbs.py. In the cylc home directory: /home/fcm/cylc-6.11.4/lib/cylc/batch_sys_handlers/

So I presume it is a matter of converting the suite.rc PBS contexts to SGE context.
I will provide a sample SGE for the ARC3 system.

comment:4 Changed 3 weeks ago by ros

Hi Mark,

I was going to send my example polaris suite which has this all in, but got diverted.

You need to set; for example:

[[[job]]]
  batch system = sge
[[[directives]]]
  -l h_rt = 00:01:00

There was a small bug in the cylc code for SGE at cylc-6.x which I have fixed on PUMA.

Cheers,
Ros.

comment:5 Changed 3 weeks ago by markr

Hello CMS,
a little more on the accuunt setup at leeds: from Martin Callaghan:

Hi Mark,

The shared accounts are actually project accounts which will be owned by you, 
and the 'ear' identifier is school specific. Within reason, you can have 
anything you like after the 'ear' bit.

If you use an existing project account, we can get this set up on ARC3 very quickly. 
If you want a new one, then it's a (paper) form to fill in and get it countersigned 
by Richard Rigby. I have a small supply of these paper forms.


Regards
Martin

So I think you will have to continue using *earhum*

If you have a Polaris UM suite then it would be nice to compare to an Archer/Monsoon equivalent as, for my work, I will be converting suites from UKCA team (Mohit Dalvi).

With Juliane's project i would likely have to convert meto internal for ARC3 use. (i.e. PBS to SGE and data paths for ARC3).

Regards,
Mark

Last edited 3 weeks ago by markr (previous) (diff)

comment:6 Changed 12 days ago by ros

  • Status changed from new to pending

comment:7 Changed 7 days ago by markr

Progress to date:

  1. ARC have enabled the ssh access to arc3.leeds.ac.uk from puma.nerc.ac.uk (using IP address)
  1. I have transferred the fcm, rose, cylc from archer umshared software but found some broken links: e.g.

lrwxrwxrwx 1 earmgr EAR 38 Dec 12 13:55 keyword.cfg → ../../../fcm_admin/etc/fcm/keyword.cfg

Do I need a fcm_admin folder?

  1. some folders on archer are very large and I do not want indiscriminately to copy 7TB of files.
  1. the .cylc/global.rc on arc3 appears to "work". I did it first on arc3 then realise I should do it on puma.

cylc get-site-config

stops at [batch systems?]

I am still not sure where to set the batch submission method to "sge".

The work continues…

comment:8 Changed 7 days ago by markr

Have now tried to runthe the "jasmin test suite " see ~markr/roses/arc3_leeds_check

it fails like this:
markr@puma arc3_leeds_check $ rose suite-run
[INFO] create: /home/markr/cylc-run/arc3_leeds_check
[INFO] create: log.20170516T112551Z
[INFO] symlink: log.20170516T112551Z ⇐ log
[INFO] create: log/suite
[INFO] create: log/rose-conf
[INFO] symlink: rose-conf/20170516T122551-run.conf ⇐ log/rose-suite-run.conf
[INFO] symlink: rose-conf/20170516T122551-run.version ⇐ log/rose-suite-run.version
[INFO] create: share
[INFO] create: share/cycle
[INFO] create: work
[INFO] export CYLC_VERSION=6.11.4
[INFO] export ROSE_ORIG_HOST=puma
[INFO] export ROSE_VERSION=2016.11.1
[INFO] install: suite.rc~
[INFO] source: /home/markr/roses/arc3_leeds_check/suite.rc~
[INFO] install: suite.rc
[INFO] 0 suite(s) unregistered.
[INFO] REGISTER arc3_leeds_check: /home/markr/cylc-run/arc3_leeds_check
[INFO] symlink: /home/markr/cylc-run/arc3_leeds_check ⇐ /home/markr/.cylc/arc3_leeds_check
[FAIL] ssh -oBatchMode=yes earmgr@… bash —login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ —name=arc3_leeds_check\ —run=run\ —remote=uuid=e93dff9b-53df-485f-9eab-d5ac4ccf1d4a\' # return-code=255, stderr=
[FAIL] Host key verification failed.

comment:9 Changed 7 days ago by ros

Hi Mark,

  1. You will only really need the fcm keyword.cfg file if you are allowing code checkouts directly on the ARC3 system. However, I would recommend creating the fcm_admin/etc/fcm folder as per on ARCHER but with a blank keyword.cfg file and then as and when needed you can populate with any required repository keywords.
  1. The batch submission system method is set in an individual suite's or rose-stem suite's suite.rc file under
[[[job]]]
  batch system = sge

Hope that helps
Cheers,
Ros.

comment:10 Changed 7 days ago by markr

NOTE ssh -Y earmgr@… works passwordless

Then I fixed the suite.rc to be markr for owner and it now fails as:
[INFO] install: suite.rc
[FAIL] ssh -oBatchMode=yes arc3.leeds.ac.uk bash —login -c \'ROSE_VERSION=2016.11.1\ rose\ suite-run\ -v\ -v\ —name=arc3_leeds_check\ —run=run\ —remote=uuid=b268bc47-3c26-48b7-9b95-5a32e43fea6b\' # return-code=255, stderr=
[FAIL] Warning: Permanently added the ECDSA host key for IP address '129.11.26.153' to the list of known hosts.
[FAIL] Permission denied (publickey,gssapi-keyex,gssapi-with-mic,password,hostbased).
markr@puma arc3_leeds_check

comment:11 Changed 6 days ago by markr

The job worked as background (hoorah!) after I change owner back to earmgr (the arc3 identity).
Then I activated SGE as the batch method.

the submission failed. So I looked at submitting the job directly as no error or stdout logfiles had been created:

[earmgr@login2.arc3 01]$ ls
job  job.status
[earmgr@login2.arc3 01]$ qsub job
Unable to run job: must specify a value for h_rt (job runtime)
Exiting.

So I need to revise the suite.rc or appropriate rose component to set up the minmum settings for SGE to work with this basic case.

Note: See TracTickets for help on using tickets.