Opened 3 years ago

Closed 3 years ago

#2052 closed help (fixed)

suite won't shutdown or run

Reported by: shakka Owned by: ros
Component: Rose Keywords:
Cc: Platform: MONSooN
UM Version: 10.4

Description

Hello,

I am trying to re-run suite u-ah710 at higher resolution than I have previously. However, when I try to run it, it says there are still active processes on gcylc (despite the job having finished and archived all the data from the previous run). When I try to shut the suite down (i.e. $ rose suite-shutdown —name=u-ah710) it fails with exit code 1.

How can I solve this? I would like to re-run the model as soon as possible.

Thanks,

Ella

Change History (9)

comment:1 Changed 3 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Ella,

It sounds like the suite has shutdown uncleanly. On exvmsrose try running:

ps -flu <username> | grep u-ah710

where <username> is your monsoon username.

Kill all the processes for this suite. You will probably need to do the same on exvmscylc as well as I can see cylc-run still running for your suite on there. Once you've killed the suite processes you should be able to start it up again.

Cheers,
Ros.

comment:2 Changed 3 years ago by shakka

Hi Ros,

Thanks for this. Sorry to be slow, but how do I kill these processes? I don't want to inadvertently delete anything I shouldn't.

Thanks,

Ella

comment:3 Changed 3 years ago by ros

Hi Ella,

You kill them with the unix kill command. E.g.

exvmscylc$ ps -flu elgil | grep u-ah710
1 S elgil     4999     1  0  80   0 - 209205 futex_ Jan06 ?       00:00:48 python /data/local/fcm/cylc-6.11.2/bin/cylc-run u-ah710

exvmscylc$ kill 4999

You may find you'll need to use kill -9 <processid> if the process is stubborn! The processid is the number in the 4th column next to the username (4999 in the above case).

Cheers,
Ros.

Last edited 3 years ago by ros (previous) (diff)

comment:4 Changed 3 years ago by shakka

Hi Ros,

I've killed all the processes but the suite still won't shutdown. Can you suggest anything further?

Ella

comment:5 Changed 3 years ago by ros

Hi Ella,

Killing all the processes is the brute force way of shutting down the suite when rose suite-shutdown doesn't work. Will the suite run now? If not can you send me the command you are running and the error messages you are getting.

Cheers,
Ros.

comment:6 Changed 3 years ago by shakka

Hi Ros,

When I try to run the suite it says 'SuiteStillRunningError?: suite "u-ah710" still has processes running on exvmscylc.monsoon-metoffice.co.uk. Try "rose suite-shutdown —name=u-ah710" first?

If I try to do that I get:

Really shutdown u-ah710 at exvmscylc.monsoon-metoffice.co.uk? [y or n (default)] y
WARNING: ignoring bad site config /data/local/fcm/cylc-6.11.2/conf/global.rc:
Illegal item: [task events]register job logs retry delays
security reasons
[FAIL] cylc shutdown u-ah710 —force —host=exvmscylc.monsoon-metoffice.co.uk # return-code=1

comment:7 Changed 3 years ago by ros

Hi Ella,

You still have process 4999 running on exvmscylc. Please run:

kill -9 4999 and then check that it has successfully been killed by running the ps -flu elgil command again.

Cheers,
Ros.

comment:8 Changed 3 years ago by shakka

Hi Ros,

Thanks so much for your help, the suite is running correctly now. I hadn't realised I'd missed one on exvmscylc.

Best,

Ella

comment:9 Changed 3 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

No worries. Glad it's working now.

I shall close this query now.

Cheers,
Ros.

Note: See TracTickets for help on using tickets.