#3045 closed help (fixed)
archive_integrity failed on old files
Reported by: | ChrisWells | Owned by: | willie |
---|---|---|---|
Component: | Data | Keywords: | MASS |
Cc: | Platform: | Monsoon2 | |
UM Version: |
Description
Hi,
I have some suites which all failed a few hours ago on archive_integrity, pointing out that some files are missing from MASS, but the files they mention aren't ones from the year the model is currently on; they're from around 50 years prior, which the model ran about a month ago.
E.g. in suite u-bm503, which is on 2274 currently, file moose:/crum/u-bm503/ap4.pp/bm503a.p42224feb.pp is missing. And u-bm502, on 2274 also, is missing bm502a.pd2222feb.pp.
The other suites affected are u-bm504, u-bm505, and u-bm798. All the failed archives seem to be from around 20th September.
I'm not sure why these missing files weren't flagged up before.
Do you know what I should do about this?
Cheers,
Chris
Change History (9)
comment:1 Changed 15 months ago by willie
comment:2 Changed 15 months ago by ChrisWells
Hi Willie,
I found this Yammer post https://www.yammer.com/ic.ac.uk/#/Threads/show?threadId=335513885073408 from 20th September - I guess my files were caught up in that, but do you know where I should go from here? I'll need to use the data once the runs have finished - is there a way of recovering them?
Cheers,
Chris
comment:3 Changed 15 months ago by willie
Hi Chris,
Unfortunately you can't share Yammer links, but I think it has the right id. You should contact Roger Milton at the Met Office. Some data is recoverable and some not.
Willie
comment:4 Changed 15 months ago by ChrisWells
Hi,
Thanks for the suggestion Willie; I contacted Roger and my data was indeed caught up in the outage.
So there is a small gap in the data output from 6 of my simulations - this will make downloading and analysing the data a pain, although it's such a short time that missing the actual data itself won't change my results (a couple of months in 150 years).
Is there a way of re-running the suites for a short timeframe, once they're completed, to "iron out" these gaps? My first thought is to just
-download the start dumps from when the runs stopped archiving
-run the model from then until the end of the gaps
Would this work? Or would postproc have an issue when it tries to upload files which already exist, e.g. a file on a stream which didn't corrupt?
Also, if a month has corrupted, will that have affected the seasonal average covering that month (and the yearly average)?
Cheers,
Chris
comment:5 Changed 15 months ago by willie
- Component changed from UM Model to Data
- Keywords MASS added
- Owner changed from um_support to willie
- Platform set to Monsoon2
- Status changed from new to accepted
Hi Chris,
Yes that would work. You would need the start dump from the previous known good cycle and start from there. The archiving will overwrite existing files but with identical data.
Regarding the seasonal averages, the answer is it depends on whether the MASS crash occurred after the seasonal average was complete or not. Since we don't know this it would be safest to assume that the seasonal averages were corrupt too.
Willie
comment:6 Changed 15 months ago by ChrisWells
- Resolution set to fixed
- Status changed from accepted to closed
Hi Willie,
Many thanks for the info - good to hear that I can overwrite the files on MASS by re-running the suite. I'll have to wait for them to finish before I go back and do this, but I'll err on the side of caution and re-run each one for a couple of years around the time they failed, to make sure it overwrite each aps and apy file that may be corrupted, as well as adding in the ones which were removed.
I'll close this ticket now.
Cheers,
Chris
comment:7 Changed 15 months ago by ChrisWells
Hi Willie,
Sorry to quickly reopen this - I've realised that my suites, which are running for ~80 years more, will keep failing on archive_integrity every 10 years due to these missing suites.
I can see that in suite conf → Testing, I could turn off archive_integrity - if I were to pause the suites, make this change, and run rose suite-run —restart, would it update this? Or maybe rose suite-run —reload?
Cheers,
Chris
comment:8 Changed 15 months ago by willie
Hi Chris,
As long as the archive_integrity hasn't actually started, hold the task, modify the suite and then rose suite-run --reload. Then you can release the task.
Willie
comment:9 Changed 15 months ago by ChrisWells
Hi Willie,
Thanks for that - will do.
Cheers,
Chris
Hi Chris,
You may have been caught up in the MASS failure about 20 September. See the Yammer announcement.
Willie