#3054 closed help (fixed)

Suite stopped

Reported by: ChrisWells Owned by: ros
Component: Rose/Cylc Keywords:
Cc: c.wells17@… Platform: Monsoon2
UM Version:

Description

(Originally post onto #3029)

Hi Ros,

I thought I'd put this on here as it might be related - it's about 1 of the suites affected here (u-bl918); the others are running ok.

The suite stopped on 23460401, with no warning - just stopped with submitted. When I run rose suite-run —restart, the gui appears with the tasks, including some 2345 postproc tasks on submitted, and then immediately goes blank on stopped.

I just tried it again - gcylc u-bl918 showed stopped with submitted again, and restarting opened the gui showing 23460401 coupled is running! qstat then shows the task coupled.2346040 is Running, and gcylc shows "stopped with running" now.

I also got this in /var/mail/chwel after this most recent restart:

suite event: aborted
reason: 23450101T0000Z
suite: u-bl918
host: xcslc0
port: 43044
owner: chwel

But I can't see any tasks for that time period in the gui.

So I'm confused as to what this suite is doing - do you know what I should do with it?

Cheers,
Chris

Change History (12)

comment:1 Changed 11 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

It looks like it's got itself into a bit of a pickle - currently not sure why. There are errors in the suite log like:

2019-10-25T11:47:07Z ERROR - 23450101T0000Z
        Traceback (most recent call last):
          File "/common/fcm/cylc-7.8.3/lib/cylc/scheduler.py", line 247, in start
            self.run()
          File "/common/fcm/cylc-7.8.3/lib/cylc/scheduler.py", line 1596, in run
            self.process_task_pool()
          File "/common/fcm/cylc-7.8.3/lib/cylc/scheduler.py", line 1241, in process_task_pool
            if meth():
          File "/common/fcm/cylc-7.8.3/lib/cylc/task_pool.py", line 1061, in remove_spent_tasks
            self.remove(itask)
          File "/common/fcm/cylc-7.8.3/lib/cylc/task_pool.py", line 511, in remove
            del self.pool[itask.point][itask.identity]
        KeyError: 23450101T0000Z

As far as I can see 23450101T0000Z all tasks have succeeded. It's 23460101 that has failed postproc tasks. Unfortunately there are no suite/log files from before today which is odd, but means I can't see what caused the suite to originally shut down.

I'll have to look at this more. Please don't do anything else with this suite while we're investigating.

Cheers,
Ros.

comment:2 Changed 11 months ago by ros

Hi Chris,

So it looks like when the suite stopped, for whatever reason, it failed to stop cleanly and is now trying to remove a task from the pool which is no longer there.

To get it going again I would try doing a warm start. It is possible to delete the offending entries from the suite database, but since I can't read the database file I can't tell you exactly what commands to run.

To do a warm start you have to tell the suite to start from a specified point. This assumes that everything before that point has succeeded.

rose suite-restart -- --warm <CYCLEPOINT>
where <CYCLEPOINT> is the point you wish to start from (ie. the first incomplete cycle)

You will need to check in the logs to determine which was the last cycle to have finished successfully and then restart from the next one. I think you need to start from 23460101T0000Z, but you will need to double check as I have not checked all the tasks from the previous cycle.

Regards,
Ros.

comment:3 Changed 11 months ago by ros

CMS Note:
For reference an issue associated with this error has been raised on github - https://github.com/cylc/cylc-flow/issues/3424

comment:4 Changed 11 months ago by ChrisWells

  • Cc c.wells17@… added

Hi Ros,

Thanks for that - I tried running from 23460101T0000Z (the command was rose suite-run -- --warm 23460101T0000Z in the end) but got this error:

[FAIL] No restart data available in NEMO restart directory:
  /home/d00/chwel/cylc-run/u-bl918/share/data/History_Data/NEMOhist
[FAIL] run_model # return-code=144
2019-10-31T09:47:16Z CRITICAL - failed/EXIT

That folder has lots of nc files from that time, but it doesn't have files like the ones on moose:/crum/u-bl918/oda.file, which are what I would think of as restart files - but I'm not fully sure what the model is looking for here.

I tried a couple of other dates (23450701T0000Z, 23430101T0000Z) but these didn't work either.

Do you know what I should do to get round this? Should I download the 2346 restart files from MASS?

Cheers,
Chris

comment:5 Changed 11 months ago by ros

Hi Chris,

If the restart files are no longer in the directory then yes you will need to download them from MASS.

Cheers,
Ros.

comment:6 Changed 11 months ago by ChrisWells

Hi Ros,

Thanks - I'll do that and see if it works.

I'm not sure if this is related, but another suite, u-bm798, has stopped with "ready"; if I run rose suite-run --restart it says stopped with suceeded; but it still has decades left to run - is there a way I can look into this further and get the suite continuing?

Cheers,
Chris

comment:7 Changed 11 months ago by ChrisWells

Hi Ros,

Having an identical issue with u-bm502.

Cheers,
Chris

comment:8 Changed 11 months ago by ros

Hi Chris,

Can you tell me what you were doing with u-bm502 please before it when down? From the logs it looks like you were trying to reset the status of archive_integrity.22940101T0000Z - is that correct?

And also u-bm798 - It looks like you tried to restart it several times but there are no suite logs for me to look at prior today so I can't see what's going on.

Regards,
Ros.

comment:9 Changed 11 months ago by ChrisWells

Hi Ros,

Yes, suite u-bm502 got caught up in the corruption of some MASS files a few weeks ago - since the team removed the corrupted files, archive_integrity, which was set to run every 10 years, fails since those files have been removed. I did pause some of my suites and turn off archive_integrity, but must've missed u-bm502.

I don't think I was doing anything to u-bm798 before, certainly nothing major, but it may have been the same issue as u-bm502.

Looking in the GUI I can see that I did turn off archive integrity for both these suites, but must've not held and reloaded them, so it tried to run it - I don't know if this might be the source of the error, for at least u-bm502 but maybe both.

Cheers,
Chris

comment:10 Changed 11 months ago by ros

Hi Chris,

The cause is definitely related to the archive_integrity task. Switching off/on tasks and reloading I don't think removes the task from any cycles that already have it loaded, so that may have been part of the issue. Anyway you will need to start the runs again from the current cycle either with a warm start or set it up to point to the appropriate start dumps and start time and do a new run rose suite-run.

Cheers,
Ros.

comment:11 Changed 11 months ago by ros

  • Reporter changed from ros to ChrisWells

comment:12 Changed 11 months ago by ChrisWells

  • Resolution set to fixed
  • Status changed from accepted to closed

Hi Ros,

Thanks for the info - I've pointed the suites to start dumps from the last year they uploaded them all to MASS, and they all seem to be running fine.

So thanks for the help with that - I'll close this ticket now.

Cheers,
Chris

Note: See TracTickets for help on using tickets.