Opened 9 months ago

Closed 8 months ago

#2758 closed help (fixed)

Rose gui not working on xcs

Reported by: ChrisWells Owned by: um_support
Component: Rose/Cylc Keywords:
Cc: Platform: Monsoon2
UM Version:

Description

Hi,

As exvmsrose will be retired on 12 February, I tried using it on xcs. But when I try and copy a suite I get the attached error. Copying still works on exvmsrose.

Do you know what this error means?

Cheers,
Chris

Attachments (1)

Rose.png (305.8 KB) - added by ChrisWells 9 months ago.

Download all attachments as: .zip

Change History (40)

Changed 9 months ago by ChrisWells

comment:1 Changed 9 months ago by dcase

Chris,
the error isn't attached, but it's almost certainly because XCS is being patched at the moment.

The message from the Yammer group is:

"Announcement: Patching XCS starting 04:00 Wednesday 6th - 11:00 Thursday 7th February 2019

We will be patching XCS and its associated Lustre file-systems with the latest updates available from Cray. This will result in an extended outage to both Monsoon and NEXCS.

Starting at 0400 local all user work will be drained. At 0800 local access to the systems will be made unavailable and patching will begin.

The scale of the patching is extensive. We will return the machine to service as soon as possible, but the system may be unavailable until the next morning. Keep an eye on this Yammer group for updates. XCE and XCF will be available as usual."

I think it should be working again now though, so please try again.

Dave

comment:2 Changed 9 months ago by ChrisWells

Hi Dave,

I think I attached it just now, rose.png? This error was only this morning, after this post on Yammer:

"just to let you know that the XCS machine is now back and accepting work again. If you notice any problems, please report them."

I'm still getting the error if I try now.

Cheers,
Chris

comment:3 Changed 9 months ago by dcase

Ah yes. Sorry.

These instructions will help: you want the bit about caching the MOSRS password:
https://collab.metoffice.gov.uk/twiki/bin/view/Support/RetirementOfRoseCylcVMs

comment:4 Changed 9 months ago by ChrisWells

Hi Dave,

No worries, thanks for the link - I've done that and it works now.

Cheers,
Chris

comment:5 Changed 9 months ago by dcase

  • Resolution set to fixed
  • Status changed from new to closed

comment:6 Changed 9 months ago by ChrisWells

Hi Dave,

Sorry, I'm still having issues. I tried to run a suite (u-bf458) and it just opens the gcylc window with "stopped with 'submit-failed'" with nothing running.

Also 2 suites I had running already (u-bf240, u-bf244) give me this error when I try and gcylc them (from xcs or exvmsrose):

Warning: cylc version mismatch!

Suite running with u'7.7.2'.
gcylc at '7.8.1'.

and both suites have paused their running. When I have the gcylc windows open, the message

/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning: Certificate for exvmscylc.monsoon-metoffice.co.uk has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning

continually prints, only in xcs (not in exvmsrose).

Sorry, do you know what I might have done wrong with this?

Cheers,
Chris

comment:7 Changed 9 months ago by ChrisWells

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:8 Changed 9 months ago by dcase

If you type cylc --version into your terminal, it will probably tell you that you are on 7.8.1 . You can type:

export CYLC_VERSION=7.7.2

and then check cylc --version to see that you have rolled it back to the one that you were using before.

If this is the version that you want, you can put this export in your .profile and so on

comment:9 Changed 9 months ago by ChrisWells

Thanks, I ran that on both exvmsrose and xcs:

[chwel@exvmsrose:~]$ cylc --version
7.8.1
[chwel@exvmsrose:~]$ export CYLC_VERSION=7.7.2
[chwel@exvmsrose:~]$ cylc --version
7.7.2
[chwel@exvmsrose:~]$ qstat
-bash: qstat: command not found
[chwel@exvmsrose:~]$ exit
logout
Connection to exvmsrose closed.
chwel@xcslc0:~> cylc --version
7.8.1
chwel@xcslc0:~> export CYLC_VERSION=7.7.2
chwel@xcslc0:~> cylc --version
7.7.2

I also added that line to my .profile. I'm not sure which version I want; I just want those suites to continue and to be able to submit new ones. The running suites are still waiting, and trying to run u-bf458 now gives a "SuiteStillRunning?" error, even though it's not (qstat on xcs returns nothing still).

And when I ssh into exvmsrose or xcs, cylc —version gives 7.8.1

Sorry, I don't really understand what's happened here - do you know how I can fix this?

Cheers,
Chris

comment:10 Changed 9 months ago by dcase

This is all getting fiddly, but as a work-around:

We can stick with 7.7.2 for now, so do everything below with this version on all computers. I don't know why ssh-ing wouldn't pick up the export in your .profile, but you may have other files (.bash_profile, .bashrc etc) and I can't remember the order in which they are sourced. You can type it into the terminal for the moment if you are working interactively and adding to these doesn't work.

Can you go to the computer on which you started the jobs, exvmsrose, and stop them? If this computer will die anyway, you'd be better off doing this.
If so can you go to xcs-c and restart the jobs from here?

If this doesn't work, I can consult some (more knowledgeable colleagues) after lunch.

comment:11 Changed 9 months ago by ChrisWells

Sorry, I'm not sure how I would restart the jobs when I've stopped them? Is there just a command line option for that?

Cheers,
Chris

comment:12 Changed 9 months ago by dcase

Yes. Go to the appropriate roses directory for your suite on the new computer, with cylc 7.7.2, and then type:

rose suite-run --restart

Hopefully that'll go.

comment:13 Changed 9 months ago by ChrisWells

Sorry, I can't seem to stop the runs - with

export CYLC_VERSION=7.7.2 , I tried

cylc stop u-bf240
cylc stop u-bf240 —kill
cylc stop u-bf240 —now —now

(from https://metomi.github.io/rose/doc/html/cheat-sheet.html)

But none of these seem to work - gcylc still shows them going, and if I try and restart in xcs I'm told they are still running.

Sorry for the hassle,

Cheers,
Chris

comment:14 Changed 9 months ago by dcase

2 quick things:

  • if you can get the GUI, can you not bring the suite up and terminate it through that?
  • could you try rearranging the command to be cylc stop --now --now u-bf240 ?

The second point may make, no difference, but may be worth a try if the GUI can't be used.

comment:15 Changed 9 months ago by ChrisWells

Pressing Stop in the GUI doesn't seem to have an effect, and that rearranged command doesn't seem to either.

It might be relevant that when I run the stop command in xcs I get:

/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning: Certificate for exvmscylc.monsoon-metoffice.co.uk has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning

But when I run it in exvmsrose I get no return (as expected?). But neither of these seem to actually stop the run.

Cheers,
Chris

comment:16 Changed 9 months ago by dcase

If contact has been lost, you can kill things manually. There are instructions here:

https://collab.metoffice.gov.uk/twiki/bin/view/Support/TroubleshootingStuckJobsInRose

on xvmsrose ps -flu chwel has a few jobs running, so follow the process in the instructions for your suites on this computer, and then see if you can get things up and running again on the new one (xcs-c).

comment:17 Changed 9 months ago by ChrisWells

I've done that for the processes that had PPID = 1 and were named python, so now all that's left is

5 S chwel    11502 11497  0  80   0 - 30635 poll_s 16:05 ?        00:00:00 sshd: chwel@pts/110
0 S chwel    11527 11502  0  80   0 - 31022 n_tty_ 16:05 pts/110  00:00:00 -bash
5 S chwel    13460 13455  0  80   0 - 30635 poll_s 16:14 ?        00:00:00 sshd: chwel@pts/3
0 S chwel    13469 13460  0  80   0 - 31025 wait   16:14 pts/3    00:00:00 -bash
1 S chwel    13514     1  0  80   0 - 28124 poll_s 16:14 ?        00:00:00 gpg-agent --daemon --allow-preset-passphrase --batch --max-cache-ttl 43200 --write
0 R chwel    13713 13469  0  80   0 - 30669 -      16:15 pts/3    00:00:00 ps -flu chwel

But in gcylc the runs are still there.

comment:18 Changed 9 months ago by dcase

In the instructions it tells you how to remove the port and service files too. I can see that /cylc-run/u-bf240/.service/contact still exists, for example (this was the one you mention above). Check all of these steps for all of your hanging suites.

comment:19 Changed 9 months ago by dcase

Also, sorry, the daemon is still running on exvmscylc. Please kill the two processes here, and go on to delete the files as in the instructions. This should provide suitably Carthaginian destruction.

comment:20 Changed 9 months ago by ChrisWells

Thanks for that - sorry for not following the instructions fully.

The job has successfully stopped! And I've restarted the 2 that I had, but I get this error on the postproc it's trying to to:

Exiting - Directory does not exist: /home/d00/chwel/cylc-run/u-bf244/work/20290901T0000Z/atmos_main

The directory doesn't exist - presumably it needs the atmos_main files to archive, but this folder isn't there anymore - is there something I can do to get round that?

Cheers,
Chris

comment:21 Changed 9 months ago by dcase

If you have a recent dump you may be able to take a step back and run from there and recalculate missing data.
Otherwise I'll look at things at some point tomorrow.

comment:22 Changed 9 months ago by ros

Hi Chris,

If you haven't already tried to run from the latest dumps please hold off from doing so. It's just dawned on me what's happened and it should be fixable without rerunning the last cycle. We'll get back to you with instructions shortly.

Cheers,
Ros.

comment:23 Changed 9 months ago by dcase

I don't know if the answer came to Ros in a prophetic dream, but apparently the configuration has changed slightly with regard to the /work directory. You can fix this by adding root-dir{work}=*=/projects/slpec/$USER to the top of roses/SUITEID/rose-suite.conf .

Then:

  • stop the suite
  • restart it
  • trigger the postproc

comment:24 Changed 9 months ago by ChrisWells

Thanks! I think I've done that right; postproc is in "running" now on both, so that seems to have worked! Thanks Dave and Ros for that.

In future, if I wanted to submit a run now for example, would I just have to make sure I'm in the right version of cylc, do I just open a new window?

Cheers,
Chris

comment:25 Changed 9 months ago by ChrisWells

Hi,

The first postproc block which failed has worked, but when it moved onto the 2nd postproc block, the same error occured - the atmos_main directory doesn't exist for the next month.

Will a similar trick work?

Cheers,
Chris

comment:26 Changed 9 months ago by ros

Hi Chris,

Damn. Now I've overlooked the fact that the next cycle (20291001T0000Z) atmos_main task has of course now run with the new link in place and put the atmos_main directory in the other place! :-(

I haven't tried this but I think it should work.

Please copy the atmos_main directory into place by running:

cp -r /working/d00/chwel/cylc-run/u-bf244/work/20291001T0000Z/atmos_main /projects/slpec/chwel/cylc-run/u-bf244/work/20291001T0000Z

No need to stop and restart the suite, just retrigger the failed postproc task.

The good news is the atmos_main directory for the next cycle is in the correct place so the postproc for the November cycle will be fine.

Sorry about this.
Cheers,
Ros.

comment:27 Changed 9 months ago by ChrisWells

Hi Ros,

That seems to have done it! Thanks for that. No worries, just one quick question before this can be closed: I have another suite ready to run - can I just run it from xcs and all will be fine? And should my version of cylc just be the current one?

Cheers,
Chris

comment:28 Changed 9 months ago by ChrisWells

Hi,

Sorry, I'm still having one of the issues I raised in an earlier comment: "I tried to run a suite (u-bf458) and it just opens the gcylc window with "stopped with 'submit-failed'" with nothing running."

This still occurs, and I can see in ~/cylc-run/u-bf458/log.20190208T114616Z/job/20200601T0000Z/fcm_make_pp/01/job-activity.log I have

[jobs-submit cmd] (init exvmsrose)
[jobs-submit ret_code] 1
[jobs-submit err] REMOTE INIT FAILED

And also ~/cylc-run/u-bf458/log.20190208T114616Z/suite/log.20190208T114634Z is full of errors.

Do you know what might be causing this?

Cheers,
Chris

comment:29 Changed 9 months ago by dcase

Chris, I just tried to read your log files, but /projects/slpec (where your cylc directory is linked to) is unresponsive. I've asked the Met Office about this, and will let you know when I have positive news.

Dave

comment:30 Changed 9 months ago by dcase

Chris,

there are problems with the Met Office computer, and they're rebooting at the moment. I hope you don't mind, but I'm going to work on something else this afternoon. You've said above that you can see the Yammer, so you can follow their progress. Also, you have backup dumps on MASS, do you not? Perhaps you could restart from an appropriate place when the computer trouble blows over?

I'll check back on Monday, when hopefully things are running more reliably.

Dave

comment:31 Changed 9 months ago by ChrisWells

Hi Dave,

No worries! Just to clarify, the 2 runs I had running before are now running fine after Ros' fix. The run that's not working is a new run, which hasn't run at all yet. So I will just submit that when the Met Office computer issues are sorted, and I'll post on here if it doesn't work then.

Many thanks for your help,
Chris

comment:32 Changed 8 months ago by ChrisWells

Hi,

These suites ran for a while, but were stopped this morning (on housekeeping). I resubmitted and they started working for a bit, but are now stuck on postproc, submit-failed, and the next block is on atmos-main, also submit-failed.

When I ask for the error, I get (for both suites)

ERROR: file not found: /home/d00/chwel/cylc-run/u-bf240/log/job/20300401T0000Z/postproc/07/job.err

There is a file

/home/d00/chwel/cylc-run/u-bf240/log/job/20300401T0000Z/postproc/07/job-activity.log

with

[jobs-submit cmd] cylc jobs-submit --utc-mode -- /home/d00/chwel/cylc-run/u-bf240/log/job 20300401T0000Z/postproc/07
[jobs-submit ret_code] 1
[jobs-submit out] 2019-02-11T15:00:07Z|20300401T0000Z/postproc/07|1|None
2019-02-11T15:00:07Z [STDERR] mkstemp: No such file or directory
2019-02-11T15:00:07Z [STDERR] qsub: could not create/open tmp file /working/d00/chwel/jtmp/tmp.cWylxKFXlS/pbsscrptFpW1IS for script

in, which I guess is the problem - do you know how I can get these suites running again, and to continue?

Cheers,
Chris

comment:33 Changed 8 months ago by ros

Hi Chris,

This is the qsub bug that was messaged about last week. The only workaround is to resubmit.

Full details on this bug can be found here: ​https://collab.metoffice.gov.uk/twiki/bin/view/Support/Monsoon2BugInQsub

comment:34 Changed 8 months ago by ChrisWells

Hi Ros,

Thanks - when it says resubmit, does that mean to stop and then run

rose suite-run --restart

? I did that, and it seems to be stuck on

[INFO] export CYLC_VERSION=7.7.2
[INFO] export ROSE_ORIG_HOST=xcslc0
[INFO] export ROSE_SITE=
[INFO] export ROSE_VERSION=2019.01.0
[INFO] delete: log/rose-suite-run.conf
[INFO] symlink: rose-conf/20190211T152252-restart.conf <= log/rose-suite-run.conf
[INFO] delete: log/rose-suite-run.version
[INFO] symlink: rose-conf/20190211T152252-restart.version <= log/rose-suite-run.version

Also, I can't log in to postproc, which I need to do to use NCO commands - do you know how I can get into postproc?

Cheers,
Chris

comment:35 Changed 8 months ago by ros

Hi Chris

The suite is stuck because it is trying to still contact exvmsrose which is currently experiencing problems along with postproc. The Met Office HPC team are aware and working on it.

To fix your suite (which you will need to to ahead of the VMs being taken down) make the following change to the site/MONSooN.rc:

Replace host = 'exvmsrose' with host = {{ROSE_ORIG_HOST}} in the [[EXTRACT_RESOURCE]] section.

I would also advise checking your other suites for this too.

Cheers,
Ros.

comment:36 Changed 8 months ago by ChrisWells

Hi Ros,

Cheers for the info. I've done that to those 2, and another suite I want to run - do you know when I should try and resubmit them?

Cheers,
Chris

comment:37 Changed 8 months ago by ros

Hi Chris,

You can resubmit them now - that change I detailed means that the suite is now not dependent on the Rose/Cylc VMs.

Cheers,
Ros.

comment:38 Changed 8 months ago by ChrisWells

Hi Ros,

Great, cheers! They seem to be running now.

Cheers,
Chris

comment:39 Changed 8 months ago by ros

  • Component changed from Monsoon to Rose/Cylc
  • Platform set to Monsoon2
  • Resolution set to fixed
  • Status changed from reopened to closed

Hi Chris,

Glad to hear the suites are running now.

I'm going to close this ticket as it's beginning to get a bit long & unwieldly and the original problems have been resolved. If you have any further problems please open a new ticket.

Thanks.
Cheers,
Ros.

Note: See TracTickets for help on using tickets.