Opened 3 years ago

Closed 3 years ago

#1919 closed help (fixed)

where do I input my Archer username in a rose suite?

Reported by: mrusso Owned by: annette
Component: Rose Keywords:
Cc: Platform: ARCHER
UM Version: 10.4

Description

Hello,
after successfully running the rose tutorial suite I have now copied a rose suite which is closest to what I'm going to be running. I've been browsing through this new suite (u-ae431) to familiarise myself with it and I find that the location, structure and format of key information is very different from the tutorial job (which is a bit confusing). In particular, I cannot find anywhere in the drop menu that allows me to input my Archer username (as I did in the tutorial job). When I try to run the job it fails (unsurprisingly), maybe because my archer username is not there or maybe for other reasons?! See error message below.
Any help would be greatly appreciated! Many thanks,
Maria

RosePopenError?: bash -ec H=$(rose\host-select\archer);\echo\$H # return-code=1, stderr=
[WARN] login5.archer.ac.uk: (ssh failed)
[WARN] login7.archer.ac.uk: (ssh failed)
…..
[FAIL] No hosts selected

Change History (16)

comment:1 Changed 3 years ago by annette

Hi Maria,

Unfortunately at the moment there is no standard suite design so they can look quite different, which I know is confusing.

Looking at your suite there is a way of getting your ARCHER username in. Open the file rose-suite.conf and add the following line:

HPC_USER='mariar'

(If you reopen the GUI you will see the variable appears under suite.conf → jinja2. You can add a new variable in the GUI as well but I think it's easier just to edit the file.)

Annette

comment:2 Changed 3 years ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to assigned

comment:3 Changed 3 years ago by mrusso

Hello, sorry I've been away. Just tried this and I can now see my username in the GUI but I still get the same error message as before so something else must be causing the failure to find a host.

Any idea why that would be or where else I could look?
Many thanks,
Maria

comment:4 Changed 3 years ago by annette

Hi Maria,

Are you still using suite u-ae431? Looking at your files I can't see where you've added the HPC_USER variable.

Annette

comment:5 Changed 3 years ago by mrusso

Ah sorry, I had made the change to u-ae837 (which is a similar job).
I've now done the same in u-ae431 and I get exactly the same error.
PS: is there a way to 'diff' the suites in the gui or is the easiest thing to do xxdiff on the suite directories?
THanks,
Maria

comment:6 Changed 3 years ago by annette

  • Component changed from UM Model to Rose

Hi Maria,

The problem is that the suite uses the command rose host-select archer to find the machine to submit the job to, whereas the tutorial suite didn't. The issue is that this command won't work by default if your PUMA and ARCHER usernames are different. To fix this you need to set your ARCHER username in your ssh config file.

On PUMA open the file ~/.ssh/config and add the following lines to the top of the file:

Host login*.archer.ac.uk
    User mariar 

Then try running the command rose host-select archer on the command line. This should now connect to login.archer.ac.uk but probably fail for the other hosts (don't worry about this). You should then be able to submit your suite.

If not let me know.

Annette

comment:7 Changed 3 years ago by mrusso

Hi Annette,
I've tried what you suggested but I get the same error on the command line (see below):

mrusso@puma:~> rose host-select archer
[WARN] login1.archer.ac.uk: (ssh failed)
[WARN] login.archer.ac.uk: (ssh failed)
[WARN] login2.archer.ac.uk: (ssh failed)
[WARN] login8.archer.ac.uk: (ssh failed)
[WARN] login4.archer.ac.uk: (ssh failed)
[WARN] login5.archer.ac.uk: (ssh failed)
[WARN] login3.archer.ac.uk: (ssh failed)
[WARN] login7.archer.ac.uk: (ssh failed)
[WARN] login6.archer.ac.uk: (ssh failed)
[FAIL] No hosts selected.

I should add that the ~/.ssh/config file was empty when I first opened it (not sure if it was meant to have something else in it).

Many thanks for your continued help!
Maria

comment:8 Changed 3 years ago by annette

Hi Maria,

Can you log into ARCHER directly without being prompted for a password or passphrase?

ssh mariar@login.archer.ac.uk

If you are prompted for a passphrase you will need to restart your agent by running:

ssh-add

If this gives an error follow the instructions here:
http://cms.ncas.ac.uk/wiki/FAQ_T4_F5

Once you can log in without a password/passphrase, can you log in without your username?

ssh login.archer.ac.uk

Annette

comment:9 Changed 3 years ago by mrusso

Hi Annette!
my ssh agent was configured correctly, however I was reading your comments on email (rather than on the web) and it shows and at the top and bottom of each line so I had added those in my config file!!!
I've now got the right lines on the config file and the suite is running past the point where it was crashing before ….so fingers crossed it will run OK!

Many many thanks again,
Maria

comment:10 Changed 3 years ago by annette

Hi Maria,

Great glad it's working now.

Can you try something out for me please? This should finish the setup for rose host-select archer.

On PUMA, run:

~um/um-training/setup-archer-hosts

This logs into each of the ARCHER login nodes so that these are added to your .ssh/known_hosts file. This is required for rose host-select archer to access login1, login2 etc, which is useful if some of the ARCHER login nodes are down as it should always find an active host.

Annette

comment:11 Changed 3 years ago by mrusso

Hi Annette,
it seems to work, this is what I get:
mrusso@puma:~> ~um/um-training/setup-archer-hosts
Connecting to ARCHER hosts…
Connected to login1.archer.ac.uk
Connected to login2.archer.ac.uk
Connected to login3.archer.ac.uk
Connected to login4.archer.ac.uk
Connected to login5.archer.ac.uk
Connected to login6.archer.ac.uk
Connected to login7.archer.ac.uk
Failed to connect to login8.archer.ac.uk
Connected to login.archer.ac.uk

…I assume login8 node must be down.

How do I incorporate that so that it is used when I run a rose suite?
Thanks,
Maria

comment:12 Changed 3 years ago by annette

Maria,

Thanks for testing that for me. It's just a one off step you need to run outside your suites. So you should now be set up to access any of the login nodes in your suites (except login8). You could try running the script again another time to add login8.

Annette

comment:13 Changed 3 years ago by mrusso

Hi Annette,
my job now fails at the compilation stage. the job.err file (see below) seems to suggest it cannot access a compiler?! I wonder if that's something you've seen before!
Many thanks again,
Maria

/etc/bash.bashrc.local: line 72: PROMPT_COMMAND: readonly variable
/etc/bash.bashrc.local: line 74: HISTCONTROL: readonly variable
/etc/bash.bashrc.local: line 76: HISTSIZE: readonly variable
/etc/bash.bashrc.local: line 72: PROMPT_COMMAND: readonly variable
/etc/bash.bashrc.local: line 74: HISTCONTROL: readonly variable
/etc/bash.bashrc.local: line 76: HISTSIZE: readonly variable
[FAIL] ftn -oo/ukca_main1_mod.o -c -I./include -s default64 -e m -J ./include -I/work/n02/n02/hum/gcom/cce8.4.1/gcom5.4/archer_xc30_cce_mpp/build/include -O2 -Ovector1 -hfp0 -hflex_mp=strict -h omp /work/n02/n02/mariar/cylc-run/u-ae431/share/fcm_make_um/preprocess-atmos/src/um/src/atmosphere/UKCA/ukca_main1-ukca_main1.F90 # rc=1
[FAIL] ftn-2136 crayftn: ERROR in command line
[FAIL] Unable to obtain a Cray Compiling Environment License.
[FAIL] compile 87.2 ! ukca_main1_mod.o ← um/src/atmosphere/UKCA/ukca_main1-ukca_main1.F90
[FAIL] ftn -oo/ni_conv_ctl.o -c -I./include -s default64 -e m -J ./include -I/work/n02/n02/hum/gcom/cce8.4.1/gcom5.4/archer_xc30_cce_mpp/build/include -O2 -Ovector1 -hfp0 -hflex_mp=strict -h omp /work/n02/n02/mariar/cylc-run/u-ae431/share/fcm_make_um/preprocess-atmos/src/um/src/atmosphere/convection/ni_conv_ctl.F90 # rc=1
[FAIL] ftn-2136 crayftn: ERROR in command line
[FAIL] Unable to obtain a Cray Compiling Environment License.
[FAIL] compile 60.2 ! ni_conv_ctl.o ← um/src/atmosphere/convection/ni_conv_ctl.F90
[FAIL] ! UKCA_MAIN1_MOD.mod : depends on failed target: ukca_main1_mod.o
[FAIL] ! ni_conv_ctl.o : update task failed
[FAIL] ! ukca_main1_mod.o : update task failed

[FAIL] fcm make -C /work/n02/n02/mariar/cylc-run/u-ae431/share/fcm_make_um -n 2 -j 6 # return-code=2
Received signal ERR
cylc (scheduler - 2016-07-26T09:14:19Z): CRITICAL Task job script received signal ERR at 2016-07-26T09:14:19Z
cylc (scheduler - 2016-07-26T09:14:19Z): CRITICAL fcm_make2_um.19880901T0000Z failed at 2016-07-26T09:14:19Z

comment:14 Changed 3 years ago by annette

Hi Maria,

This looks like it might just be a temporary ARCHER issue. Try re-submitting the build. As your suites is still active, you can do this from the GUI by re-triggering the task.

If you don't have the GUI still up, you can relaunch it by running:

rose suite-gcontrol --name=ae431

Then right-click on the fcm_make2_um task and select the option that is something like "trigger run now".

If you see this again, we should report to ARCHER.

Annette

comment:15 Changed 3 years ago by mrusso

thanks! It has compiled ok now.
Maria

comment:16 Changed 3 years ago by annette

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.