Opened 3 weeks ago

Closed 4 days ago

#3503 closed help (worksforme)

Running ANTS suite on MONSOON

Reported by: mtodt Owned by: um_support
Component: Monsoon Keywords: ANTS MONSOON
Cc: Platform: Monsoon2
UM Version: <select version>

Description

Hi

I'm trying to run an ANTS suite on MONSOON (u-cd393), which is a copy of a previously working suite, but I get the following message in job.err:

+ scp mtodt@xcslc0:/home/d00/mtodt/cylc-run/u-cd393/log/rose-suite-run.version /home/d00/mtodt/cylc-run/u-cd393/log/job/1/ancilMask/01
Warning: Permanently added 'xcslc0,10.168.14.10' (ECDSA) to the list of known hosts.
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,password,keyboard-interactive,hostbased).
[FAIL] AncilScr_RoseMask # return-code=1
2021-03-29T14:29:07Z CRITICAL - failed/EXIT

When I enter the scp command manually, I get no such message and it works. I am already running the suite on xcslc0, though, so should I run it somewhere else? Or do I need to change something in my .ssh directory? Many thanks in advance!

Best
Markus

Change History (37)

comment:1 Changed 3 weeks ago by pmcguire

Hi Markus:
I looked at ~pamcg/cylc-run/u-bv358/suite.rc.processed, and it says there that the ROSE_ORIG_HOST is xcslc0. So it looks like that this is the right host to use.

I am still looking over your problem… If I find anything else, I will let you know.
Patrick

comment:2 Changed 3 weeks ago by mtodt

Hi Patrick

Thanks for having a look at this so quickly!

Best
Markus

comment:3 Changed 3 weeks ago by pmcguire

Hi Markus:
When I try to rerun the original suite ~pamcg/roses/u-bv358 as ~pamcg/roses/u-bv358try2 from xcslc0, the task that was failing for you (ancilMask) succeeds just fine for me. It's running the ancilVegFrac app now.
Patrick

comment:4 Changed 3 weeks ago by mtodt

Hmm, I haven't changed anything, so it might be down to my SSH settings then. Do you have an SSH key for xcslc0 on MONSOON/xcslc0? Is it included in your ~/.ssh/config?

Best
Markus

comment:5 Changed 3 weeks ago by pmcguire

Hi Markus:
In this job.err file: ~pamcg/cylc-run/u-bv358try2/log/job/1/ancilMask/01/job.err, you can see that I don't have an scp error message after this line:
[[ ! -f /home/d01/pamcg/cylc-run/u-bv358try2/log/job/1/ancilMask/01/rose-suite-run.version ]].

I don't have an ssh key on xcslc0 for xcslc0. But it does happen to be in my .ssh/known_hosts file.
I don't have a ~/.ssh/config file.
Patrick

comment:6 Changed 3 weeks ago by mtodt

Hi Patrick

Thanks for the link to your job.err file! The only difference I can see is that you scp from xcslc1 instead of xcslc0. Did you run your job on there? If so, I might do the same.

Best
Markus

comment:7 Changed 3 weeks ago by pmcguire

Hi Markus:
No, I ran from xcslc0 just like you did.
Patrick

comment:8 Changed 3 weeks ago by mtodt

Hmm, that is strange then, since our suites are the same.

I've been trying to run the suite from xcslc1 but it failed to build/submit properly because of disk quota:

[FAIL] cylc run u-cd393 # return-code=1, stderr=
[FAIL] 2021-03-29T16:02:17Z ERROR - [Errno 122] Disk quota exceeded
[FAIL] 	Traceback (most recent call last):
[FAIL] 	  File "/common/fcm/cylc-7.8.7/lib/cylc/scheduler.py", line 250, in start
[FAIL] 	    self.configure()
[FAIL] 	  File "/common/fcm/cylc-7.8.7/lib/cylc/scheduler.py", line 393, in configure
[FAIL] 	    self.load_suiterc()
[FAIL] 	  File "/common/fcm/cylc-7.8.7/lib/cylc/scheduler.py", line 1014, in load_suiterc
[FAIL] 	    share_dir=self.suite_share_dir,
[FAIL] 	  File "/common/fcm/cylc-7.8.7/lib/cylc/config.py", line 224, in __init__
[FAIL] 	    self.pcfg = RawSuiteConfig(fpath, output_fname, template_vars)
[FAIL] 	  File "/common/fcm/cylc-7.8.7/lib/cylc/cfgspec/suite.py", line 436, in __init__
[FAIL] 	    self.loadcfg(fpath, "suite definition")
[FAIL] 	  File "/common/fcm/cylc-7.8.7/lib/parsec/config.py", line 73, in loadcfg
[FAIL] 	    sparse = parse(rcfile, self.output_fname, self.tvars)
[FAIL] 	  File "/common/fcm/cylc-7.8.7/lib/parsec/fileparse.py", line 339, in parse
[FAIL] 	    handle.write('\n'.join(flines) + '\n')
[FAIL] 	IOError: [Errno 122] Disk quota exceeded
[FAIL] 2021-03-29T16:02:17Z CRITICAL - Suite shutting down - [Errno 122] Disk quota exceeded

And now I just get

[FAIL] Suite "u-cd393" appears to be running:
[FAIL] Contact info from: "/home/d00/mtodt/cylc-run/u-cd393/.service/contact"
[FAIL] Try "cylc stop 'u-cd393'" first?

but /home/d00/mtodt/cylc-run/u-cd393/.service/contact is empty and there are no jobs showing up when I do ps -flu mtodt | grep u-cd393. And when I do rose src the window opens but there are no tasks.

Do you know what else I can do?

Best
Markus

Last edited 3 weeks ago by mtodt (previous) (diff)

comment:9 Changed 3 weeks ago by ros

Hi Markus,

I think this is definitely a disk issue. Just looking at your log files when you had the permission denied, the suite/log had a disk I/O issue which could cause ssh issue.

I can't see which disk area it is now saying has a quota exceeded currently.

Trying removing the .service/contact file.

We need to figure out which disk has run out of space.

Cheers,
Ros.

comment:10 Changed 3 weeks ago by pmcguire

Hi Markus:
You probably ran it first on xcslc0, and it failed and went to a stopped state. If you want to try to run it on xcslc1, then you need to first stop it on xcslc0.

But I don't think you need to run it on xcslc1, since it was working for me on xcslc0.
Patrick

comment:11 Changed 3 weeks ago by mtodt

Hi Ros

Thanks a lot for the quick update!

Best
Markus

comment:12 Changed 3 weeks ago by mtodt

Hi Patrick

Thanks, but it happens on both xcslc0 and xcslc1, unfortunately.

Cheers
Markus

comment:13 Changed 3 weeks ago by ros

Hi Markus,

/projects/nexcs-n02 is full

rhatcher@xcs-c$ quota.py -g nexcs-n02 lustre_multi
Disk quotas for group nexcs-n02 (gid 40075):
Filesystem           TB    Quota       %  |      Files      Quota       %
--------------  -------  -------  ------  |  ---------  ---------  ------
/.lustre_multi   230.01   230.00  100.00  |   31639408          0    0.00

Regards,
Ros.

comment:14 Changed 3 weeks ago by pmcguire

Hi Markus:
BTW, have you installed the anaconda version of python with ANTS in it?
in ~pamcg/roses/u-bv358try2/site/MONSOON_ext/variables.rc, there is this setting, which should maybe point to my installation of it, since you don't have that directory:
{%- set ANCIL_ENVIRONMENT_PATH = '~/anaconda3/envs/ants/bin/' %}

You might try changing that in your suite to:
{%- set ANCIL_ENVIRONMENT_PATH = '~pamcg/anaconda3/envs/ants/bin/' %}

If that doesn't work, then either you can try installing the anaconda version of Python that includes ANTS yourself.

Alternatively, you might wait till I get this suite working on JASMIN, where it uses the singularity container version of ANTS.
I don't know how long that will take though for me to get this suite working on JASMIN.

Also, how is your $SCRATCH variable set?
In that same file, it points to $SCRATCH, which may or may not be full.
Patrick

comment:15 Changed 3 weeks ago by mtodt

Hi Patrick

Thanks a lot for that advice! I will change the path accordingly.

I had checked my own directories at first, including $SCRATCH, and deleted some files. So, I don't think that's an issue.

Best
Markus

comment:16 Changed 3 weeks ago by ros

Hi Markus,

Can you clear some stuff out of your /projects/nexcs-n02 space and try re-running your job before you start making changes? I thought you said this suite was copied from one you already had running successfully?

Cheers,
Ros.

comment:17 Changed 3 weeks ago by mtodt

Hi Ros

Patrick ran the original suite successfully on MONSOON last summer. I copied it and submitted it without any changes to it.

I've deleted files in my /projects/nexcs-02/. I still get the

[FAIL] Suite "u-cd393" appears to be running:
[FAIL] Contact info from: "/home/d00/mtodt/cylc-run/u-cd393/.service/contact"
[FAIL] Try "cylc stop 'u-cd393'" first?

error message when trying to run, though, and there's no process listed when I do ps -flu mtodt | grep u-cd393.

Best
Markus

comment:18 Changed 3 weeks ago by ros

Hi Markus,

You need to remove the contact file referenced above before you'll be able to re-run the suite.

Cheers,
Ros.

comment:19 Changed 3 weeks ago by mtodt

Hi Ros

Thanks a lot! That worked, and now I'm back to the original error message:

+ scp mtodt@xcslc1:/home/d00/mtodt/cylc-run/u-cd393/log/rose-suite-run.version /home/d00/mtodt/cylc-run/u-cd393/log/job/1/ancilMask/01
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,password,keyboard-interactive,hostbased).
[FAIL] AncilScr_RoseMask # return-code=1

For whatever reason (I'm running on xcslc0), the scp command now uses xcslc1. I suppose that answers my earlier question about a difference between xcslc0 and xcslc1.

Since Patrick can run the suite mine is copied from, and his job.err file shows that his suite is able to use scp pamcg@xcslc1:..., I assume that it might be an issue with my SSH settings or something like that.

I also changed ANCIL_ENVIRONMENT_PATH as Patrick outlined earlier, but that didn't have an effect.

Best
Markus

comment:20 Changed 3 weeks ago by ros

Hi Markus,

Make sure you can ssh to both xcslc0 and xcslc1 from both nodes with no prompts for input.
I (like Patrick) don't have any entries in my ssh config to enable connection between the login nodes

If that does nothing then it might be worth changing the remote host to be localhost since the machine you are submitting from and to is the same and asking a suite to ssh from a machine to itself can cause problems. I think you've encountered that before with some of your suites and it certainly is a problem we see on Monsoon - though not quite these conditions.

Set host = localhost

In [[CAP_ENV_SETUP]] and [[BACKGROUND]] change to

[[[remote]]]
     host = localhost

It may not help but worth a shot.

Cheers,
Ros.

comment:21 Changed 3 weeks ago by mtodt

Hi Ros

Thanks a lot for the advice! I've changed the remote host to localhost, but the error indeed remained the same.

Regarding ssh-ing, I have to enter my MOSRS password when using ssh, but according to the setup instructions that should be the case, shouldn't it? When I manually enter the command that apparently causes the error message, I don't have to enter anything. So, I don't quite understand why it causes the failure.

I haven't installed condo or the ANTS environment myself, instead I've set ANCIL_ENVIRONMENT_PATH to Patrick's directory as he suggested above. Might that be an issue? Should I rather install condo and ANTS with it as outlined here?

Many thanks again for your help!
Markus

comment:22 Changed 3 weeks ago by ros

Hi Markus,

It should only prompt on interactive ssh so that should be ok and cylc is managing to submit the suite so I think it's fine.

You could try editing the script being run by the ancilMask task (can't remember the name off the top of my head and Monsoon is down for the day now) to remove the xcs reference and just do a normal cp instead.

[If you really want to debug & fix the scp problem then I can only suggest adding a -vvv to the command to get debug info.]

Regards,
Ros.

comment:23 Changed 3 weeks ago by pmcguire

Hi Markus:
Is the scp working yet?

Did it help otherwise for changing the path from the non-existent anaconda3 path in your directory to the anaconda3 path in my directory?
Patrick

comment:24 Changed 3 weeks ago by mtodt

Hi Ros

Thanks a lot! I'll try the debug option first once MONSOON is back.

Best
Markus

comment:25 Changed 3 weeks ago by mtodt

Hi Patrick

Changing the anaconda path didn't have an effect (yet), but I'll keep it in nonetheless.

Best
Markus

comment:26 Changed 9 days ago by mtodt

Hi Ros, Patrick

I'm still working on figuring out how to get the ANTS suite going. I haven't yet found the reason why the scp command doesn't work for me. Since it's used at the end of the task for a log file, I thought I might just do it manually and move on to the next task. That indeed works, but then subsequent tasks ancilOrog and ancilVegfrac fail with the same error. After that, though, the vegetation-related tasks ancil_lct, ancil_lct_postproc_c4, ancil_lai, and ancil_canopy_heights succeed when I trigger them manually. I'll move on to the next tasks now.

Best
Markus

Last edited 8 days ago by mtodt (previous) (diff)

comment:27 Changed 8 days ago by mtodt

Hi Ros, Patrick

There are some soil-related tasks after the vegetation tasks, and the first one fails with the following error:

AttributeError: module 'ants.utils.cube' has no attribute 'fix_mask'
[FAIL] python_env \${CONTRIB_PATH}/SoilParameters/ancil_soils.py ${source} \
[FAIL] --lct-ancillary ${vegfrac} --soils-lookup ${soils_lookup} \
[FAIL] -o ${output} --ants-config ${config} <<'__STDIN__'
[FAIL] 
[FAIL] '__STDIN__' # return-code=1

I checked cylc-run/u-cd393/src/ants/lib/ants/utils/cube.py, and it indeed doesn’t contain that function. I was wondering whether this could be due to different versions of ANTS or something like that? I haven't changed anything since copying Patrick's suite, though, so that would have to be an automatic setting.

Best
Markus

comment:28 Changed 8 days ago by pmcguire

Hi Markus:
Yes, I got the same error. See:

~/cylc-run/u-bv358try2/log/job/1/ancil_soils_hydr/01/job.err

You can see that the fix_mask function seems to have been introduced in ANTS in version 0.15. It wasn't in previous versions.
See, for example:
https://code.metoffice.gov.uk/doc/ancil/ants/0.15/_modules/ants/utils/cube.html

We're using version 0.13 for this suite.

Maybe that's enough to get started on figuring it out?

Patrick

comment:29 Changed 8 days ago by mtodt

Hi Patrick

Thanks a lot! Alright, I'll try moving to version 0.15 then.

Best
Markus

comment:30 Changed 8 days ago by pmcguire

Hi Markus:
That might not be so easy. I think the ANTS is built into the anaconda installation of Python that is currently in my home directory. I guess we had version 0.13 installed there, though I am not sure.

Maybe it's easier to wait till I have the JASMIN version of the ANTS suite working. Or maybe there are other easier ways too.
Patrick

comment:31 Changed 8 days ago by mtodt

Hi Patrick

Thanks for the advice! I'll first try to run a copy of Martin Best's recent suite u-bx038 then, and I cross my fingers that it won't run into the same issue. I suppose it makes sense to set up my own conda environment, though.

Best
Markus

comment:32 Changed 5 days ago by mtodt

Hi Patrick, Ros

I've created a copy of Martin Best's u-bx038 and added the MONSOON-specific settings and files that Patrick had added in u-bv358. The resulting suite is u-cd739, which I've tested for vegetation and soil ancillaries only for now (ANCIL_CREATE_VEGFRAC = true). All tasks succeed except for ancilVegfrac, which features the same scp error message as before:

+ scp mtodt@xcslc1:/home/d00/mtodt/cylc-run/u-cd739/log/rose-suite-run.version /home/d00/mtodt/cylc-run/u-cd739/log/job/1/ancilVegfrac/01
Permission denied, please try again.
Permission denied, please try again.
Permission denied (publickey,password,keyboard-interactive,hostbased).

I'll be back to investigating this issue again, but generally it's good news I'd say since that issue seemed to be specific to me.

Best
Markus

comment:33 Changed 5 days ago by grenville

Hi Markus

Have you tried getting rid of the scp command and simply replacing it with cp (and the appropriate file and path name?
Grenville

comment:34 Changed 5 days ago by mtodt

Hi Grenville

Not yet, that's what I'm trying to do now. But I can't find the right script in my suite directory. Would I have to create my own branch and change it in there?

Best
Markus

comment:35 Changed 5 days ago by pmcguire

Hi Markus:
It looks like your suite fails during the call to this script:

/projects/um1/ancil/vn9.1/cray/ancil/build/v1.1/bin/AncilScr_RoseVegfrac 

which calls this script:

/projects/um1/ancil/vn9.1/cray/ancil/build/v1.1/bin/AncilScr_RoseFinalise 

which has the scp in it that is causing the permission denied.

Maybe, you can do as Grenville suggests, and make your own copies of these scripts and point to them in your ancilVegfrac Rose/Cylc app, and change the scp to cp?
Patrick

comment:36 Changed 5 days ago by mtodt

Hi Patrick

Thanks for pointing out the scripts! It makes sense seeing the "Finalise" script, as in my previous suite multiple tasks failed with the error message. I'll try create my own versions then.

Best
Markus

comment:37 Changed 4 days ago by mtodt

  • Resolution set to worksforme
  • Status changed from new to closed

Hi Patrick, Grenville, Ros

The suite is working now after I created and modified a local copy of the scripts. I suppose I should still look into why the scp command fails for me but not for Patrick, but I'll close the ticket now since I have a working solution.

Thanks a lot for your help!
Markus

Note: See TracTickets for help on using tickets.