Opened 2 months ago

Closed 6 weeks ago

#3256 closed help (fixed)

No hosts selected - MONSOON2

Reported by: mvguarino Owned by: ros
Component: Coupled model Keywords: ANTS, Port from Met Office
Cc: Platform: Monsoon2
UM Version:

Description

Hello,

I am trying to run my first simulation on MONSOON2, suite u-bt694 username mvguar. When I launch the suite I get the following error:

[FAIL] bash -ec H=$(rose\ host-select\ xcs);\ echo\ $H # return-code=1, stderr=
[FAIL] [WARN] xcslr1: (timed out)
[FAIL] [WARN] xcslr0: (timed out)
[FAIL] [FAIL] No hosts selected.

Under Machine Options I selected 'xcs'.

The parent suite was run on the internal MO network.

Thanks!

Vittoria

Change History (17)

comment:1 Changed 2 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Vittoria,

This suite is only setup to run on the internal MO machine. xcslr1 and xcslr2 are the research zone (internal) machines.

It's not usually difficult to change. I'll let you know what needs to be changed. I will also check if this suite has already been run on Monsoon.

Cheers
Ros

comment:2 Changed 2 months ago by mvguarino

Hi Ros,

Okay, thank you!
I am running a copy of the CMIP6 Hist run but with a different source code, as this will be a sensitivity test.

Perhaps in the meantime I can seek advice about another small thing:
Although my default project account is nexcs-n02, the jmmp project account will be charged for these runs. I have been added to the project and changed the project account accordingly in rose-edit. However, how do I make sure that data (outputs and/or cylc directory) will be directed to /jmmp/mvguar?

Vittoria

comment:3 Changed 2 months ago by ros

Hi Vittoria,

To direct the output to the correct project workspace add the following lines (or replace existing ones) to the top of the rose-suite.conf file with:

root-dir{share}=*=/projects/jmmp/$USER
root-dir{work}=*=/projects/jmmp/$USER

Cheers,
Ros.

comment:4 Changed 2 months ago by ros

Hi Vittoria,

Copy the following to files into the site directory of your suite:

  • /home/d04/rhatcher/roses/u-bt694-mvguar/site/monsoon.rc
  • /home/d04/rhatcher/roses/u-bt694-mvguar/site/monsoon_restart.rc

These are basically copies of the corresponding meto_cray files with a few modifications for Monsoon.

In the rose-suite.conf file set:

  • ACCOUNT_USR='jmmp'
  • SITE='monsoon'

In the suite.rc file add monsoon to the list of sites for a single FCMUM:

  • {% set SINGLE_FCMUM = ['meto_cray','monsoon'] %}

That should at least get all the executables built ok.

Cheers,
Ros.

Last edited 2 months ago by ros (previous) (diff)

comment:5 Changed 2 months ago by mvguarino

Hi Ros,

Yes, thanks! That all worked. I managed to launched the suite now, but the retrive_ozone task is failing with this error:

select command-id=926360479 failed: (SSC_TASK_REJECTION) one or more tasks are rejected.
  moose:/crum/u-bf801/ap4.pp -> /home/d05/mvguar/cylc-run/u-bt694/share/ozone_redistribution/2NDARYa.p41949.pp: (TSSC_QUERY_MATCHES_NO_RESULTS) no file atoms are matched by query text file.
select: failed (2)

It seems I don't have the right query file, this is probably something I have to ask the MO about.. but let me know if you know how to deal with the issue instead.

Vittoria

comment:6 Changed 2 months ago by mvguarino

I have figured it out, the suite was pointing to the wrong MASS dataset.
I changed u-bf801 (1st part of Historical run) to u-bg466 (second part of Historical run: from 1875 to 2014), which is what I needed.

However, now redistridute_ozone is failing. I think this has to do again with using a wrong path as I am running on MONSOON:

/bin/sh: /data/users/support/ants/_dev/environments/v0.9.0/bin/python: No such file or directory
[FAIL] /data/users/support/ants/_dev/environments/v0.9.0/bin/python $CYLC_SUITE_RUN_DIR/bin/redistribute_ozone.py -t $TROPOPAUSE_INPUT -r $OROGRAPHY_INPUT -d $DENSITY_INPUT -z $OZONE_INPUT -o $OZONE_OUTPUT -y $YEAR <<'__STDIN__'
[FAIL] 
[FAIL] '__STDIN__' # return-code=127
2020-05-06T09:12:23Z CRITICAL - failed/EXIT
/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning: Certificate for xcslc1 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning

what should I replace /data/users/support/ants/_dev/environments/v0.9.0/bin/python: with?

Thanks,

Vittoria

comment:7 Changed 2 months ago by ros

Hi Vittoria,

Glad you manage to sort the MASS problem out.

Sorry, but I've had no luck, so far, figuring out where that directory is on Monsoon. I suggest you send an email to the Monsoon team (monsoon @ metoffice.gov.uk)

Cheers,
Ros.

comment:8 Changed 2 months ago by mvguarino

I will,

thank you.

Vittoria

comment:9 Changed 2 months ago by ros

Hi Vittoria,

We've found the version used by the nesting suites on Monsoon: /home/d03/hadsw/miniconda2/envs/ants_v0.8.0/bin/python

Cheers,
Ros.

comment:10 Changed 2 months ago by mvguarino

Hi,

Thank you, I have tried but it complains about an Import error:

home/d03/hadsw/miniconda2/envs/ants_v0.8.0/lib/python2.7/site-packages/iris/fileformats/grib/__init__.py:59: IrisDeprecation: The module iris.fileformats.grib is deprecated since v1.10. Please install the package 'iris_grib' package instead.
  "The module iris.fileformats.grib is deprecated since v1.10. "
Traceback (most recent call last):
  File "/home/d05/mvguar/cylc-run/u-bt694/bin/redistribute_ozone.py", line 23, in <module>
    import ants
  File "/home/d03/hadsw/miniconda2/envs/ants_v0.8.0/lib/python2.7/site-packages/ants/__init__.py", line 5, in <module>
    import analysis
  File "/home/d03/hadsw/miniconda2/envs/ants_v0.8.0/lib/python2.7/site-packages/ants/analysis/__init__.py", line 40, in <module>
    import _merge
  File "/home/d03/hadsw/miniconda2/envs/ants_v0.8.0/lib/python2.7/site-packages/ants/analysis/_merge.py", line 16, in <module>
    from shapely.vectorized import contains
  File "/home/d03/hadsw/miniconda2/envs/ants_v0.8.0/lib/python2.7/site-packages/shapely/vectorized/__init__.py", line 3, in <module>
    from ._vectorized import (contains, touches)
ImportError: /home/d03/hadsw/miniconda2/envs/ants_v0.8.0/lib/python2.7/site-packages/shapely/vectorized/_vectorized.so: undefined symbol: GEOSPreparedTouches_r
[FAIL] /home/d03/hadsw/miniconda2/envs/ants_v0.8.0/bin/python  $CYLC_SUITE_RUN_DIR/bin/redistribute_ozone.py -t $TROPOPAUSE_INPUT -r $OROGRAPHY_INPUT -d $DENSITY_INPUT -z $OZONE_INPUT -o $OZONE_OUTPUT -y $YEAR <<'__STDIN__'
[FAIL] 
[FAIL] '__STDIN__' # return-code=1
2020-05-06T11:39:17Z CRITICAL - failed/EXIT

Vittoria

comment:11 Changed 2 months ago by mvguarino

Hello,

I have emailed Monsoon about this issue but they are being awfully quiet.
Meanwhile I have tried to use another version of ants (v11), which I found here: /home/d02/hadzc/miniconda2/envs/ants_v0p11

Assuming using different versions is not a problem (I would need v.0.9.0), this seems to work except for Mule:

File "/home/d02/hadzc/miniconda2/envs/ants_v0p11/lib/python2.7/site-packages/ants/fileformats/ancil/__init__.py", line 35, in <module>
    import mule
ImportError: No module named mule
[FAIL] /home/d02/hadzc/miniconda2/envs/ants_v0p11/bin/python  $CYLC_SUITE_RUN_DIR/bin/redistribute_ozone.py -t $TROPOPAUSE_INPUT -r $OROGRAPHY_INPUT -d $DENSITY_INPUT -z $OZONE_INPUT -o $OZONE_OUTPUT -y $YEAR <<'__STDIN__'
[FAIL] 

Any chance this is could be solved in manner similar to http://cms.ncas.ac.uk/ticket/2824 ?

Thanks,

Vittoria

Last edited 2 months ago by mvguarino (previous) (diff)

comment:12 Changed 2 months ago by ros

Hi Vittoria,

Worth a try to loading the um_tools environment and see what happens.

Cheers,
Ros.

comment:13 Changed 7 weeks ago by mvguarino

Hi Ros,

An update on the issue I have been having:

I have been in contact with the Met Office Helpdesk; the solution was to install Miniconda, create the ants_v0.13 environment and install manually a version of mule.
I thought I would report the steps required here, in case someone in the future will encounter the same problem.

1) Download and install the latest version of Miniconda and ANTS, see: https://code.metoffice.gov.uk/doc/ancil/ants/latest/install.html

2) Check out mule:
svn co https://code.metoffice.gov.uk/svn/um/mule/trunk ./mule-trunk

3) Create and activate the right environment:
conda create -n ants_0.13 --override-channels ants -c mo_csitda/label/antsv0.13
conda activate ants_0.13

4) install mule:

cd ./mule-trunk/mule/
python setup.py install
cd ../um_utils/
python setup.py install
conda install numba llvmlite

5) Add to the newly created environment
ants_0.13 the mo_pack python package, otherwise redistribuite_ozone will fail again as it won’t be able to deal with pp files:

conda install -c conda-forge --name ants_0.13 mo_pack

comment:14 Changed 7 weeks ago by mvguarino

While that fixed the environment problem, the task keeps on failing now for a connection timed out problem similar to the one I originally raised the ticket for:

ssh: connect to host xcslr0 port 22: Connection timed out
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(641) [sender=3.0.4]
2020-05-15T14:30:04Z CRITICAL - failed/EXIT

From job.out it seems to be happening at this stage:

[INFO] copying /home/d05/mvguar/cylc-run/u-bt694/share/ozone_redistribution/mmro3_monthly_CMIP6_1950_N96_bt694-ancil_2anc to xcslr0:~/cylc-run/u-bt694/share/data/etc/ozone/mmro3_monthly_CMIP6_1950_N96_bt694-ancil_2anc

Should it try to copy something to xcslr0? Is there some other setting that must be adapted to run on monsoon that you can think of?

Thank you,

Vittoria

comment:15 Changed 7 weeks ago by ros

Hi Vittoria,

Thanks for posting your solution to the ANTS problem.

xcslr0 is the internal MO machine so somewhere in your suite it is still referring to it. Looks like it is hard-wired in the site/monsoon.rc file. Try changing this to xcslc0 (or xcs-c)

Regards,
Ros.

comment:16 Changed 7 weeks ago by mvguarino

xcslc0 worked,

thank you

comment:17 Changed 6 weeks ago by ros

  • Keywords ANTS, Port from Met Office added
  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.