I am working for a customer right now who
is considering Red Hat Storage Server. One of the sought-after features
there is bit-rot detection, and even better, (semi-automatic) bit-rot
restoration.
I know RHSS (nor GlusterFS) has this kind of
functionality at the moment (correct me if i'm wrong!), but I would like
to propose a design for this as a sort of translator that can be
stacked on i.e. a (geo) replication translator.
Bit-rot detection can be done through check-summing.
It should be a very low priority job running on one of the bricks. The
job walks the complete file system and, per file, calculates the
check-sum, compares it with the stored check-sum (if present, otherwise
it stores the check-sum on all involved bricks, because it hasn't been
checked before).
Bit-rot restoration could be implemented by
comparing the check-sums of the replicas. If there is a mismatch, a more
thorough check must be performed, like running a check-sum on all
replica's for that file again, do a bit-wise compare, or whatever. If
the files are still the same, the check-sum(s) must be replaced. If not,
there is actual bit-rot detected. Now what to do? Which replica holds
the clean version (the thruth?). With an uneven number of replicas one
could simply make it a democratic process and have it fully automated.
It should however save the to be replaced version in a separate store
and notify the admin for verification. Another method would be to just
notify the admin and do nothing.
The obvious place to store the check-sums would be in the extended attributes, but one could use a database for it.
I have watch the presentation Red Hat Summit 2012 - A Deep Dive Into Red Hat Storage by
Jeff Darcy and I know he (and Red Hat) are very keen on extending the
number of translators with useful functionality. I am no programmer
myself, but would like to get involved in this kind of stuff.
Comments are very welcome!
Fred
PS (this was already posted on the user list, but I was advised to post it on the devel list)