Opened 15 months ago
Closed 15 months ago
#3054 closed help (fixed)
Suite stopped
Reported by: | ChrisWells | Owned by: | ros |
---|---|---|---|
Component: | Rose/Cylc | Keywords: | |
Cc: | c.wells17@… | Platform: | Monsoon2 |
UM Version: |
Description
(Originally post onto #3029)
Hi Ros,
I thought I'd put this on here as it might be related - it's about 1 of the suites affected here (u-bl918); the others are running ok.
The suite stopped on 23460401, with no warning - just stopped with submitted. When I run rose suite-run —restart, the gui appears with the tasks, including some 2345 postproc tasks on submitted, and then immediately goes blank on stopped.
I just tried it again - gcylc u-bl918 showed stopped with submitted again, and restarting opened the gui showing 23460401 coupled is running! qstat then shows the task coupled.2346040 is Running, and gcylc shows "stopped with running" now.
I also got this in /var/mail/chwel after this most recent restart:
suite event: aborted
reason: 23450101T0000Z
suite: u-bl918
host: xcslc0
port: 43044
owner: chwel
But I can't see any tasks for that time period in the gui.
So I'm confused as to what this suite is doing - do you know what I should do with it?
Cheers,
Chris
Change History (12)
comment:1 Changed 15 months ago by ros
- Owner changed from um_support to ros
- Status changed from new to accepted
comment:2 Changed 15 months ago by ros
Hi Chris,
So it looks like when the suite stopped, for whatever reason, it failed to stop cleanly and is now trying to remove a task from the pool which is no longer there.
To get it going again I would try doing a warm start. It is possible to delete the offending entries from the suite database, but since I can't read the database file I can't tell you exactly what commands to run.
To do a warm start you have to tell the suite to start from a specified point. This assumes that everything before that point has succeeded.
rose suite-restart -- --warm <CYCLEPOINT>
where <CYCLEPOINT> is the point you wish to start from (ie. the first incomplete cycle)
You will need to check in the logs to determine which was the last cycle to have finished successfully and then restart from the next one. I think you need to start from 23460101T0000Z, but you will need to double check as I have not checked all the tasks from the previous cycle.
Regards,
Ros.
comment:3 Changed 15 months ago by ros
CMS Note:
For reference an issue associated with this error has been raised on github - https://github.com/cylc/cylc-flow/issues/3424
comment:4 Changed 15 months ago by ChrisWells
- Cc c.wells17@… added
Hi Ros,
Thanks for that - I tried running from 23460101T0000Z (the command was rose suite-run -- --warm 23460101T0000Z in the end) but got this error:
[FAIL] No restart data available in NEMO restart directory: /home/d00/chwel/cylc-run/u-bl918/share/data/History_Data/NEMOhist [FAIL] run_model # return-code=144 2019-10-31T09:47:16Z CRITICAL - failed/EXIT
That folder has lots of nc files from that time, but it doesn't have files like the ones on moose:/crum/u-bl918/oda.file, which are what I would think of as restart files - but I'm not fully sure what the model is looking for here.
I tried a couple of other dates (23450701T0000Z, 23430101T0000Z) but these didn't work either.
Do you know what I should do to get round this? Should I download the 2346 restart files from MASS?
Cheers,
Chris
comment:5 Changed 15 months ago by ros
Hi Chris,
If the restart files are no longer in the directory then yes you will need to download them from MASS.
Cheers,
Ros.
comment:6 Changed 15 months ago by ChrisWells
Hi Ros,
Thanks - I'll do that and see if it works.
I'm not sure if this is related, but another suite, u-bm798, has stopped with "ready"; if I run rose suite-run --restart it says stopped with suceeded; but it still has decades left to run - is there a way I can look into this further and get the suite continuing?
Cheers,
Chris
comment:7 Changed 15 months ago by ChrisWells
Hi Ros,
Having an identical issue with u-bm502.
Cheers,
Chris
comment:8 Changed 15 months ago by ros
Hi Chris,
Can you tell me what you were doing with u-bm502 please before it when down? From the logs it looks like you were trying to reset the status of archive_integrity.22940101T0000Z - is that correct?
And also u-bm798 - It looks like you tried to restart it several times but there are no suite logs for me to look at prior today so I can't see what's going on.
Regards,
Ros.
comment:9 Changed 15 months ago by ChrisWells
Hi Ros,
Yes, suite u-bm502 got caught up in the corruption of some MASS files a few weeks ago - since the team removed the corrupted files, archive_integrity, which was set to run every 10 years, fails since those files have been removed. I did pause some of my suites and turn off archive_integrity, but must've missed u-bm502.
I don't think I was doing anything to u-bm798 before, certainly nothing major, but it may have been the same issue as u-bm502.
Looking in the GUI I can see that I did turn off archive integrity for both these suites, but must've not held and reloaded them, so it tried to run it - I don't know if this might be the source of the error, for at least u-bm502 but maybe both.
Cheers,
Chris
comment:10 Changed 15 months ago by ros
Hi Chris,
The cause is definitely related to the archive_integrity task. Switching off/on tasks and reloading I don't think removes the task from any cycles that already have it loaded, so that may have been part of the issue. Anyway you will need to start the runs again from the current cycle either with a warm start or set it up to point to the appropriate start dumps and start time and do a new run rose suite-run.
Cheers,
Ros.
comment:11 Changed 15 months ago by ros
- Reporter changed from ros to ChrisWells
comment:12 Changed 15 months ago by ChrisWells
- Resolution set to fixed
- Status changed from accepted to closed
Hi Ros,
Thanks for the info - I've pointed the suites to start dumps from the last year they uploaded them all to MASS, and they all seem to be running fine.
So thanks for the help with that - I'll close this ticket now.
Cheers,
Chris
It looks like it's got itself into a bit of a pickle - currently not sure why. There are errors in the suite log like:
As far as I can see 23450101T0000Z all tasks have succeeded. It's 23460101 that has failed postproc tasks. Unfortunately there are no suite/log files from before today which is odd, but means I can't see what caused the suite to originally shut down.
I'll have to look at this more. Please don't do anything else with this suite while we're investigating.
Cheers,
Ros.