Opened 3 months ago

Closed 3 months ago

Last modified 3 months ago

#3045 closed help (fixed)

archive_integrity failed on old files

Reported by: ChrisWells Owned by: willie
Component: Data Keywords: MASS
Cc: Platform: Monsoon2
UM Version:

Description

Hi,

I have some suites which all failed a few hours ago on archive_integrity, pointing out that some files are missing from MASS, but the files they mention aren't ones from the year the model is currently on; they're from around 50 years prior, which the model ran about a month ago.

E.g. in suite u-bm503, which is on 2274 currently, file moose:/crum/u-bm503/ap4.pp/bm503a.p42224feb.pp is missing. And u-bm502, on 2274 also, is missing bm502a.pd2222feb.pp.

The other suites affected are u-bm504, u-bm505, and u-bm798. All the failed archives seem to be from around 20th September.

I'm not sure why these missing files weren't flagged up before.

Do you know what I should do about this?

Cheers,
Chris

Change History (9)

comment:1 Changed 3 months ago by willie

Hi Chris,

You may have been caught up in the MASS failure about 20 September. See the Yammer announcement.

Willie

comment:2 Changed 3 months ago by ChrisWells

Hi Willie,

I found this Yammer post https://www.yammer.com/ic.ac.uk/#/Threads/show?threadId=335513885073408 from 20th September - I guess my files were caught up in that, but do you know where I should go from here? I'll need to use the data once the runs have finished - is there a way of recovering them?

Cheers,
Chris

comment:3 Changed 3 months ago by willie

Hi Chris,

Unfortunately you can't share Yammer links, but I think it has the right id. You should contact Roger Milton at the Met Office. Some data is recoverable and some not.

Willie

comment:4 Changed 3 months ago by ChrisWells

Hi,

Thanks for the suggestion Willie; I contacted Roger and my data was indeed caught up in the outage.

So there is a small gap in the data output from 6 of my simulations - this will make downloading and analysing the data a pain, although it's such a short time that missing the actual data itself won't change my results (a couple of months in 150 years).

Is there a way of re-running the suites for a short timeframe, once they're completed, to "iron out" these gaps? My first thought is to just

-download the start dumps from when the runs stopped archiving
-run the model from then until the end of the gaps

Would this work? Or would postproc have an issue when it tries to upload files which already exist, e.g. a file on a stream which didn't corrupt?

Also, if a month has corrupted, will that have affected the seasonal average covering that month (and the yearly average)?

Cheers,
Chris

comment:5 Changed 3 months ago by willie

  • Component changed from UM Model to Data
  • Keywords MASS added
  • Owner changed from um_support to willie
  • Platform set to Monsoon2
  • Status changed from new to accepted

Hi Chris,

Yes that would work. You would need the start dump from the previous known good cycle and start from there. The archiving will overwrite existing files but with identical data.

Regarding the seasonal averages, the answer is it depends on whether the MASS crash occurred after the seasonal average was complete or not. Since we don't know this it would be safest to assume that the seasonal averages were corrupt too.

Willie

comment:6 Changed 3 months ago by ChrisWells

  • Resolution set to fixed
  • Status changed from accepted to closed

Hi Willie,

Many thanks for the info - good to hear that I can overwrite the files on MASS by re-running the suite. I'll have to wait for them to finish before I go back and do this, but I'll err on the side of caution and re-run each one for a couple of years around the time they failed, to make sure it overwrite each aps and apy file that may be corrupted, as well as adding in the ones which were removed.

I'll close this ticket now.

Cheers,
Chris

comment:7 Changed 3 months ago by ChrisWells

Hi Willie,

Sorry to quickly reopen this - I've realised that my suites, which are running for ~80 years more, will keep failing on archive_integrity every 10 years due to these missing suites.

I can see that in suite conf → Testing, I could turn off archive_integrity - if I were to pause the suites, make this change, and run rose suite-run —restart, would it update this? Or maybe rose suite-run —reload?

Cheers,
Chris

comment:8 Changed 3 months ago by willie

Hi Chris,

As long as the archive_integrity hasn't actually started, hold the task, modify the suite and then rose suite-run --reload. Then you can release the task.

Willie

comment:9 Changed 3 months ago by ChrisWells

Hi Willie,

Thanks for that - will do.

Cheers,
Chris

Note: See TracTickets for help on using tickets.