Re: BitRot notes

Venky Shankar <yknev.shankar@xxxxxxxxx> · Tue, 9 Dec 2014 14:52:42 +0530

On Tue, Dec 9, 2014 at 1:41 PM, Deepak Shetty <dpkshetty@xxxxxxxxx> wrote:
> We can use bitrot to provide a 'health' status for gluster volumes.
> Hence I would like to propose (from a upstream/community perspective) the
> notion of 'health' status (as part of gluster volume info) which can derive
> its value from:
>
> 1) Bitrot
>     If any files are corrupted and bitrot is yet to repair them and/or its a
> signal to admin to do some manual operation to repair the corrupted files
> (for cases where we only detect, not correct)
>
> 2) brick status
>     Depending on brick offline/online
>
> 3) AFR status
>     Whether we have all copies in sync or not

This makes sense. Having a notion of "volume health" helps choosing
intelligently from a list of volumes.

>
> This i believe is on similar lines to what Ceph does today (health status :
> OK, WARN, ERROR)

Yes, Ceph derives those notions from PGs. Gluster can do it for
replicas and/or files marked by bitrot scrubber.

> The health status derivation can be pluggable, so that in future more
> components can be added to query for the composite health status of the
> gluster volume.
>
> In all of the above cases, as long as data can be served by the gluster
> volume reliably gluster volume status will be Started/Available, but Health
> status can be 'degraded' or 'warn'

WARN may be too strict, but something lenient enough yes descriptive
should be chosen. Ceph does it pretty well:
http://ceph.com/docs/master/rados/operations/monitoring-osd-pg/

>
> This has many uses:
>
> 1) It helps provide indication to the admin that something is amiss and he
> can check based on:
> bitrot / scrub status
> brick status
> AFR status
>
> and take necessary action
>
> 2) It helps mgmt applns (openstack for eg) make an intelligent decision
> based on the health status (whether or not to pick this gluster volume for
> this create volume operation), so it helps acts a a coarse level filter
>
> 3) In general it gives user an idea of the health of the volume (which is
> different than the availability status (whether or not volume can serve
> data))
> For eg: If we have a pure DHT volume, and bitrot detects silent file
> corruption (and we are not auto correcting) having Gluster volume status as
> available/started isn't entirely correct !

+1

>
> thanx,
> deepak
>
>
> On Fri, Dec 5, 2014 at 11:31 PM, Venky Shankar <yknev.shankar@xxxxxxxxx>
> wrote:
>>
>> On Fri, Nov 28, 2014 at 10:00 PM, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:
>> > On 11/28/2014 08:30 AM, Venky Shankar wrote:
>> >>
>> >> [snip]
>> >>>
>> >>>
>> >>> 1. Can the bitd be one per node like self-heal-daemon and other
>> >>> "global"
>> >>> services? I worry about creating 2 * N processes for N bricks in a
>> >>> node.
>> >>> Maybe we can consider having one thread per volume/brick etc. in a
>> >>> single
>> >>> bitd process to make it perform better.
>> >>
>> >>
>> >> Absolutely.
>> >> There would be one bitrot daemon per node, per volume.
>> >>
>> >
>> > Do you foresee any problems in having one daemon per node for all
>> > volumes?
>>
>> Not technically :). Probably that's a nice thing to do.
>>
>> >
>> >>
>> >>>
>> >>> 3. I think the algorithm for checksum computation can vary within the
>> >>> volume. I see a reference to "Hashtype is persisted along side the
>> >>> checksum
>> >>> and can be tuned per file type." Is this correct? If so:
>> >>>
>> >>> a) How will the policy be exposed to the user?
>> >>
>> >>
>> >> Bitrot daemon would have a configuration file that can be configured
>> >> via Gluster CLI. Tuning hash types could be based on file types or
>> >> file name patterns (regexes) [which is a bit tricky as bitrot would
>> >> work on GFIDs rather than filenames, but this can be solved by a level
>> >> of indirection].
>> >>
>> >>>
>> >>> b) It would be nice to have the algorithm for computing checksums be
>> >>> pluggable. Are there any thoughts on pluggability?
>> >>
>> >>
>> >> Do you mean the default hash algorithm be configurable? If yes, then
>> >> that's planned.
>> >
>> >
>> > Sounds good.
>> >
>> >>
>> >>>
>> >>> c) What are the steps involved in changing the hashtype/algorithm for
>> >>> a
>> >>> file?
>> >>
>> >>
>> >> Policy changes for file {types, patterns} are lazy, i.e., taken into
>> >> effect during the next recompute. For objects that are never modified
>> >> (after initial checksum compute), scrubbing can recompute the checksum
>> >> using the new hash _after_ verifying the integrity of a file with the
>> >> old hash.
>> >
>> >
>> >>
>> >>>
>> >>> 4. Is the fop on which change detection gets triggered configurable?
>> >>
>> >>
>> >> As of now all data modification fops trigger checksum calculation.
>> >>
>> >
>> > Wish I was more clear on this in my OP. Is the fop on which checksum
>> > verification/bitrot detection happens configurable? The feature page
>> > talks
>> > about "open" being a trigger point for this. Users might want to trigger
>> > detection on a "read" operation and not on open. It would be good to
>> > provide
>> > this flexibility.
>>
>> Ah! ok. As of now it's mostly open() and read(). Inline verification
>> would be "off" by default due to obvious reasons.
>>
>> >
>> >>
>> >>>
>> >>> 6. Any thoughts on integrating the bitrot repair framework with
>> >>> self-heal?
>> >>
>> >>
>> >> There are some thoughts on integration with self-heal daemon and EC.
>> >> I'm coming up with a doc which covers those [reason for delay in
>> >> replying to your questions ;)]. Expect the doc in in gluster-devel@
>> >> soon.
>> >
>> >
>> > Will look forward to this.
>> >
>> >>
>> >>>
>> >>> 7. How does detection figure out that lazy updation is still pending
>> >>> and
>> >>> not
>> >>> raise a false positive?
>> >>
>> >>
>> >> That's one of the things that myself and Rachana discussed yesterday.
>> >> Should scrubbing *wait* till checksum updating is still in progress or
>> >> is it expected that scrubbing happens when there is no active I/O
>> >> operations on the volume (both of which imply that bitrot daemon needs
>> >> to know when it's done it's job).
>> >>
>> >> If both scrub and checksum updating go in parallel, then there needs
>> >> to be way to synchronize those operations. Maybe, compute checksum on
>> >> priority which is provided by the scrub process as a hint (that leaves
>> >> little window for rot though) ?
>> >>
>> >> Any thoughts?
>> >
>> >
>> > Waiting for no active I/O in the volume might be a difficult condition
>> > to
>> > reach in some deployments.
>> >
>> > Some form of waiting is necessary to prevent false positives. One
>> > possibility might be to mark an object as dirty till checksum updation
>> > is
>> > complete. Verification/scrub can then be skipped for dirty objects.
>>
>> Makes sense. Thanks!
>>
>> >
>> > -Vijay
>> >
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel@xxxxxxxxxxx
>> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
>
>
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel