Opened 6 months ago

Closed 6 months ago

Last modified 6 months ago

#3029 closed help (fixed)

Suites keep stopping

Reported by: ChrisWells Owned by: ros
Component: Rose/Cylc Keywords:
Cc: Platform: Monsoon2
UM Version:

Description

Hi,

I have some suites running for a long time (u-bl918, u-bm502, u-bm503, u-bm504, u-bm505, u-bm798) and they keep stopping, with the message in gcylc "stopped with running". When I do rose suite-run —restart on them, they start again, but they stop again sometime after. This is extending the time the suites will take to run (e.g. I just had to restart 5 out of 6, which had all stopped on Friday).

Do you know why this is happening, and how I can prevent it?

Cheers,
Chris

Attachments (1)

u-bm502.png (232.4 KB) - added by ChrisWells 6 months ago.

Download all attachments as: .zip

Change History (19)

comment:1 Changed 6 months ago by grenville

Chris

At what stage do they stop?

Grenville

Changed 6 months ago by ChrisWells

comment:2 Changed 6 months ago by ChrisWells

Hi Grenville,

I'm not sure how to tell - I've attached a screenshot of u-bm502 which I just restarted; it seems some postproc tasks were stopped in the 22470401 block - is there a better way for me to tell?

Cheers,
Chris

comment:3 Changed 6 months ago by ros

Hi Chris,

I've just been looking at this and I can't see any helpful messages in any of the logs of the couple of suites I've looked at. In the meantime if any of them stop again please don't restart it but let us know so that we can take a look.

Cheers,
Ros.

comment:4 Changed 6 months ago by ChrisWells

Hi Ros,

Thanks, will do.

Cheers,
Chris

comment:5 Changed 6 months ago by ros

Hi Chris,

I suspect something happened on xcslc1 which caused the 5 cylc daemons that you had running on there to stop. u-bm505 is running from the other login node and would explain why it didn't need restarting. Monsoon are currently making enquiries as another user noticed some oddities with xcslc1 on Friday.

When login nodes are rebooted (e.g. definitely last Monday when they were both restarted) you will always need to restart the suite(s) as the controlling daemons will be killed (any task running on the compute/serial nodes continue to run but subsequent tasks won't be submitted).

Regards,
Ros.

Last edited 6 months ago by ros (previous) (diff)

comment:6 Changed 6 months ago by ChrisWells

Hi Ros,

Thanks for the info - this has happened fairly often recently though, maybe every few days; have there been reboots that frequently? The simulations are running fine currently - I'll update on this ticket if they do stop again.

Cheers,
Chris

comment:7 Changed 6 months ago by ros

  • Owner changed from um_support to ros
  • Platform set to Monsoon2
  • Status changed from new to accepted

Hi Chris,

There were various issues, not necessarily reboots, occurring on 13/09, 27/09, 30/09 & 04/10 that eerily coincide with the times your suites have stopped. I'll leave this ticket open for now and please do update if any of them stop in the same way again, including the id's of the affected suites.

Cheers,
Ros.

comment:8 Changed 6 months ago by ChrisWells

Hi Ros,

Suite u-bm505 has stopped on running - I've left it as stopped.

Cheers,
Chris

comment:9 Changed 6 months ago by ros

Hi Chris,

Thanks. The Met Office are looking into this.

Regards,
Ros.

comment:10 Changed 6 months ago by ros

Hi Chris,

It looks like the cause of the problem is some over-zealous house-keeping job that the HPC runs to kill old languishing processes. I have no information at present as to why this started happening or why it appears to be only you, to my knowledge, being affected, but HPC team are reviewing and monitoring the process.

Regards,
Ros.

comment:11 Changed 6 months ago by ChrisWells

Hi Ros,

Thanks for looking into that - should I keep u-bm505 stopped? And if others stop, should I restart them or keep them stopped?

Cheers,
Chris

comment:12 Changed 6 months ago by ros

  • Component changed from UM Model to Rose/Cylc

Hi Chris,

Sorry, yes you can restart u-bm505 and any other should they stop again. I'll keep this ticket open for now and will check back with you in a week or so to see if things have improved.

Cheers,
Ros.

comment:13 Changed 6 months ago by ros

  • Status changed from accepted to pending

comment:14 Changed 6 months ago by ChrisWells

Hi Ros,

Great, will do, thanks.

Cheers,
Chris

comment:15 Changed 6 months ago by ChrisWells

Hi Ros,

All the suites mentioned above stopped and had to be restarted last Wednesday 9th October, but since then they have all ran fine without stopping, so I think it's alright for you to close this ticket if that's alright.

Cheers,
Chris

comment:16 Changed 6 months ago by ros

  • Resolution set to fixed
  • Status changed from pending to closed

Hi Chris,

Thanks for letting me know - that's great news. I shall close this ticket for now, but if the issue comes back please re-open it.

Cheers
Ros

comment:17 Changed 6 months ago by ChrisWells

Hi Ros,

I thought I'd put this on here as it might be related - it's about 1 of the suites affected here (u-bl918); the others are running ok.

The suite stopped on 23460401, with no warning - just stopped with submitted. When I run rose suite-run —restart, the gui appears with the tasks, including some 2345 postproc tasks on submitted, and then immediately goes blank on stopped.

I just tried it again - gcylc u-bl918 showed stopped with submitted again, and restarting opened the gui showing 23460401 coupled is running! qstat then shows the task coupled.2346040 is Running, and gcylc shows "stopped with running" now.

I also got this in /var/mail/chwel after this most recent restart:

suite event: aborted
reason: 23450101T0000Z
suite: u-bl918
host: xcslc0
port: 43044
owner: chwel

But I can't see any tasks for that time period in the gui.

So I'm confused as to what this suite is doing - do you know what I should do with it?

Cheers,
Chris

comment:18 Changed 6 months ago by ros

Moved to #3054

Note: See TracTickets for help on using tickets.