Re: Cephalocon QA: RGW scrubbing

"Varada Kari (System Engineer)" <varadaraja.kari@xxxxxxxxxxxx> · Mon, 16 Apr 2018 12:55:38 +0530

We have few tools, we use internally to do the scrub on RGW buckets.
You can find the code at
https://github.com/Flipkart/ceph-tools/tree/scrub/scrub
These are python based and trusts bucket index is sane and contains
valid index(we periodically run bucket index check).
As mentioned in previous mails, we are in the process of adding these
tools to rgw code base.

Varada

On Thu, Apr 5, 2018 at 12:06 PM, Varada Kari (System Engineer)
<varadaraja.kari@xxxxxxxxxxxx> wrote:
> On Thu, Apr 5, 2018 at 4:33 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>> On Wed, Apr 4, 2018 at 8:18 AM, Yehuda Sadeh-Weinraub
>> <ysadehwe@xxxxxxxxxx> wrote:
>>> On Wed, Apr 4, 2018 at 12:55 AM, Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>>>> To conclude the meeting, Varada from Flipkart had a proposal to build
>>>> an “rgw scrub” which can be used in a lab (or live!) setting. He noted
>>>> that while RADOS has a very good scrubbing mechanism, there’s nothing
>>>> for RGW to ensure its higher-level data integrity, and that’s
>>>> distressing both in deployments and when making sure that tests of new
>>>> patches have correctly preserved data, rather than silently relying on
>>>> something in a caching layer.
>>>> --
>>>
>>> Right. It is something that we really miss. It can probably be tied up
>>> with multisite DR, but not necessarily. For a single site we could
>>> pretty much only report that objects went bad (which is still useful,
>>> and probably help with tracking down corruption issues). In multi-zone
>>> environment we could use it fo actually recover data.
>>
>> Well, there's more room than that. It's perfectly plausible to recover
>> an S3 object if we find it's present in the cluster but isn't in a
>> bucket index or a gc log, for instance. It's a lot more work that has
>> value in some scenarios, though I'm not sure what that is relative to
>> other stuff. :) And I think the main use-case Varada had in mind was
>> validating changes they make in a local branch before deploying it to
>> production services.
>> -Greg
>
> Not only to check the internal changes we make, we had multiple
> problems with the cache tier. Because osd failures in cache tier we
> had incomplete PGs and lost some objects. There is noway to find out
> what was lost until and unless we scrub/scan all the buckets by
> reading all the objects. That lead us to write some off-line python
> scripts using rados to stat and read objects for faster validation.
> will send a PR soon with the tools we have for this. Currently working
> on a prototype for the same in radosgw like an online tool.
>
> Varada
>
>
>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html