#2370 closed help (fixed)

all suites stopped

Reported by: s.varma13 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version: 10.8

Description

Hi, all my suites have stopped (u-au329, u-au333, u-au341 and u-au342). Could you please let me know how I should restart these please?

Attachments (1)

Doc.docx (1.3 MB) - added by s.varma13 17 months ago.
Screenshot of rose sgc for one suite

Download all attachments as: .zip

Change History (17)

Changed 17 months ago by s.varma13

Screenshot of rose sgc for one suite

comment:1 Changed 17 months ago by willie

Hi Sunil,

Monsoon is down at the moment - see http://collab.metoffice.gov.uk/twiki/bin/view/Support/LatestNews. There is advice on Yammer about stuck suites, reproduced here

Stuck Jobs in Rose

If the usual "cylc shutdown" or "rose suite-shutdown" doesn't work you may need to do the following:

ssh to exvmscylc ( not exvmsrose!) and run

ps -fu ${USER}

kill <PID>

This is the PID for the PPID=1 python process for your suite.

Then remove the port file:

rm $HOME/.cylc/ports/<suiteid>

If no processes remain, also delete your ~/cylc-run/SUITE/.service/contact files as well. (Normally, these files are removed automatically by cylc, but they may be lingering due to an abnormal shut down.)

rm ~/cylc-run/<suiteid>/.service/contact

Regards
Willie

comment:2 Changed 17 months ago by s.varma13

Hi Willie

Thanks a lot for this. Any idea when Monsoon will come back online? I tried the Yammer link for further updates but could not login

The instructions above are to completely kill the job or to get it moving again? So if the former and I want to run it from the point it stopped, do I then change the start dump file location in astart to the last one archived to moose and then set the model basis time to the same date, switch off build and reconfigure, and press `rose suite-run ? And with the latter, do I just do a restart with rose suite-run —restart?

Best wishes

Sunil

comment:3 Changed 17 months ago by ros

Hi Sunil,

Monsoon came back online earlier. There are some aspects which are still offline - as detailed on the above news posting.

Just do what you usually do to restart a stopped suite.(i.e. rose suite-run --restart). It will pick up from where it left off - no need to change anything.

Cheers,
Ros.

Last edited 17 months ago by ros (previous) (diff)

comment:4 Changed 17 months ago by ros

P.S.
Regarding the Yammer group - the link is an internal one. You should have been sent an invitation to Yammer last year to access the external collaboration group. If you have not received an invitation, let us know and we will ask the Monsoon folks to send you another invitation.

comment:5 Changed 17 months ago by s.varma13

Thanks a lot Ros - I did not receive a Yammer link. COuld you please ask the Monsoon folks to send me another invite?

Cheers

Sunil

comment:6 Changed 17 months ago by ros

Hi Sunil,

Yes, I've sent Monsoon a request.

Cheers,
Ros.

comment:7 Changed 17 months ago by s.varma13

Hi Ros

Many thanks.

One thing, when I tried to restart one of runs the following fail occurred because it is still running:

[FAIL] Suite "u-au329" has running processes on: exvmscylc.monsoon-metoffice.co.uk
[FAIL] Try "cylc stop 'u-au329'" first?

Just checking that I should do the above? Do I just type "cylc stop u-au329" on the command line at suvar@exvmsrose:~/roses/u-au329]$ and then rose suite-run —restart?

Many thanks.

Sunil

comment:8 Changed 17 months ago by ros

Hi Sunil.

Yes. You will probably find the command will fail. The instructions that Willie sent will then need to be followed.

Cheers,
Ros

comment:9 Changed 17 months ago by s.varma13

Hi Ros/Willie?

So I first try "cylc shutdown" or "rose suite-shutdown" at suvar@exvmsrose:~/roses/u-au329 and then do rose suite-run —restart or do I have to change model basis time and astart start dump location to same date and then rose suite-run?

And if "cylc shutdown" or "rose suite-shutdown" does not work, then

ssh to exvmscylc and run

ps -fu ${USER}; do I type this as is or replace USER with suvar, so ps -fu suvar

kill <PID>

This is the PID for the PPID=1 python process for your suite. Where do I find this?

Then remove the port file:

rm ~/.cylc/ports/u-au329

If no processes remain, also delete your ~/cylc-run/SUITE/.service/contact files as well. (Normally, these files are removed automatically by cylc, but they may be lingering due to an abnormal shut down.)

rm ~/cylc-run/u-au329/.service/contact

Now do I do rose suite-run —restart or change the model basis time and astart start dump location and then rose suite-run

Thank you.

Sunil

comment:10 Changed 17 months ago by ros

Hi Sunil,

You do not need to change the model basis time to restart a suite. This is the whole point of Rose/Cylc. rose suite-run --restart will automatically pick up from where the suite has previously stopped.

The PID is found by running the ps -flu ${USER} command as detailed. Please try running the commands as per the instructions and it should all fall into place.

Cheers,
Ros.

comment:11 Changed 17 months ago by s.varma13

Hi Ros, thank you very much.

comment:12 Changed 17 months ago by s.varma13

Hi Ros/Willie?

So I ran ps -flu ${USER} and there is no PPID=1 in the list so not sure which PID to kill. COuld you let me know which one I should kill from the list? Many thanks. Sunil

[suvar@exvmscylc:~]$ ps -flu ${USER}
F S UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME CMD
5 S suvar 583 578 0 80 0 - 30633 poll_s 09:37 ? 00:00:00 sshd:
0 S suvar 596 583 0 80 0 - 30742 wait 09:37 pts/1 00:00:00 -bash
0 R suvar 726 596 0 80 0 - 30669 - 09:39 pts/1 00:00:00 ps -f

comment:13 Changed 17 months ago by ros

Hi Sunil,

You don't have any suite processes running so move onto the next instruction. Remove the port file, if there is one, and the contacts file.

Cheers,
Ros.

comment:14 Changed 17 months ago by s.varma13

Hi Ros,

I tried to remove the port file:

rm $HOME/.cylc/ports/u-au329

but I do not have a .cylc file. Again because I have no processes remaining, so just checking that is right and I should now remove ~/cylc-run/<suiteid>/.service/contact?

Many thanks. Sunil

[suvar@exvmsrose:~]$ rm $HOME/.cylc/ports/u-au329
rm: cannot remove `/home/d04/suvar/.cylc/ports/u-au329': No such file or directo ry
[suvar@exvmsrose:~]$ pwd
/home/d04/suvar
[suvar@exvmsrose:~]$ ls -ltr -a
total 196
-rwxr-xr-x. 1 suvar mo_users 1446 Mar 20 2017 .xinitrc.template
-rw-r—r—. 1 suvar mo_users 1940 Mar 20 2017 .xim.template
-rw-r—r—. 1 suvar mo_users 6043 Mar 20 2017 .muttrc
-rw-r—r—. 1 suvar mo_users 861 Mar 20 2017 .inputrc
-rw-r—r—. 1 suvar mo_users 18251 Mar 20 2017 .gnu-emacs
drwxr-xr-x. 2 suvar mo_users 4096 Mar 20 2017 .fonts
-rw-r—r—. 1 suvar mo_users 1637 Mar 20 2017 .emacs
-rw-r—r—. 1 suvar mo_users 1255 Jul 18 2017 .profile
drwx———. 3 suvar mo_users 4096 Jul 18 2017 .gnupg
drwxr-xr-x. 2 suvar mo_users 4096 Jul 18 2017 bin
drwx———. 3 suvar mo_users 4096 Sep 25 15:45 .config
drwxr-xr-x. 3 suvar mo_users 4096 Sep 27 15:23 .subversion
drwxr-xr-x. 2 suvar mo_users 4096 Sep 27 15:33 .metomi
-rw-r—r—. 1 suvar mo_users 1457 Sep 27 15:43 .bashrc
drwx———. 2 suvar mo_users 4096 Oct 18 13:47 .ssh
drwxr-xr-x. 3 suvar mo_users 4096 Oct 18 13:48 meta
drwxr-xr-x. 4 suvar mo_users 4096 Oct 23 17:33 .mozilla
drwxr-xr-x. 2 suvar mo_users 4096 Oct 23 17:33 Desktop
drwxr-xr-x. 2 suvar mo_users 4096 Oct 23 17:33 .fontconfig
-rw———-. 1 suvar mo_users 77 Oct 23 19:57 .lesshst
drwxr-xr-x. 3 suvar mo_users 4096 Oct 24 13:57 .gnome2
drwx———. 4 suvar mo_users 4096 Oct 24 13:57 .cache
drwxr-xr-x. 2 suvar mo_users 4096 Oct 24 17:29 .moosedir
drwxr-xr-x. 8 suvar mo_users 4096 Oct 30 12:25 suvar
drwxr-xr-x. 2 suvar mo_users 4096 Nov 3 12:52 .vim
drwxr-xr-x. 3 suvar mo_users 4096 Nov 9 23:34 umui_runs
drwxr-xr-x. 2 suvar mo_users 12288 Nov 9 23:34 output
-rw-r—r—. 1 suvar mo_users 142 Nov 26 16:38 .xconvrc
-rw———-. 1 suvar mo_users 4530 Jan 11 11:54 nohup.out
drwxr-xr-x. 84 root root 4096 Jan 17 14:33 ..
drwxr-xr-x. 46 suvar mo_users 4096 Jan 20 18:29 cylc-run
drwxr-xr-x. 32 suvar mo_users 4096 Jan 24 15:22 roses
-rw———-. 1 suvar mo_users 9688 Jan 24 18:05 .viminfo
-rw———-. 1 suvar mo_users 3311 Jan 26 09:50 .Xauthority
drwxr-xr-x. 22 suvar mo_users 4096 Jan 26 09:50 .
-rw———-. 1 suvar mo_users 16271 Jan 26 09:52 .bash_history

comment:15 Changed 17 months ago by s.varma13

Suites have all restarted - many thanks.

Sunil

comment:16 Changed 17 months ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.