Re: detection/correction of corruption with raid6

Neil Brown <neilb@xxxxxxx> · Fri, 19 Dec 2008 15:39:39 +1100

On Wednesday December 17, piergiorgio.sartor@xxxxxxxx wrote:
> On Tue, 2008-12-16 at 23:25 +0100, Redeeman wrote:
> [...]
> > > Why a RAID system might have inconsistencies?
> > > Why do we have a "check" command at all, to run weekly or monthly?
> > As previously stated in discussion, while most bitflips etc does not
> > happen on disk(apparently), they do happen, whether its in ram, pci,
> > controller etc...
> 
> Ah! You spoiled it! :-)
> 
> Actually I was waiting for an answer from Neil Brown.
> 
> Because I'm under the impression that if it is not the HD,
> it does not count... See below...

Suppose we agree that bit flips don't happen (undetected) on drive
media.  But that bit flips can happen elsewhere (memory.  IO Buss
etc).

And then suppose we discover that a bit-flip has happened.  What does
that tell us?
Maybe it tells us that our hardware is dodgey.  So it cannot be
trusted to reliably do anything we tell it.  So maybe we shouldn't
tell it to do anything. ??

And when we find a corruption, we clear cannot know if it is corrupt
on disk (a previous write went bad) or just in memory (e.g. a recent
read was bad).
In the latter case, writing anything to disk is probably the wrong
thing to do.  In the former case it might be a good thing to do - if
we can be fairly sure that the error happens very rarely.
And of course we cannot know if it was due to a bad read or a bad
write.  So the safe course is to not write anything to disk.

Where does that leave us?

About the only thing that makes sense is to always read all the blocks
in a stripe, and to perform a consistency test before responding to
any read request.  If an inconsistency is found, we log what we know,
and only return data if we have some reason to believe something is
still valid (e.g. a majority vote for raid1).

And for raid5/6, a write would require:
  read whole stripe
  check consistency
  copy in new data
  update parity
  write out changed blocks

This is going to be a substantial slowdown.

And does it really increase your data security?  or is it like putting
a lock on your front door but not on your back door?

I guess it would provide some protection against low-frequency errors
in the controller/cable/drive.

But given the high cost and the fairly low value, I wonder how many
people would really use it....

> 
> Final point. More or less one year ago the same topic popped up,
> with similar discussion.
> At the end of the thread someone was asking if patches are
> accepted in order to implement this feature.
> I could not find any answer to that question in the archive.
> 
> What is the idea? Are patches accepted? Rejected by default?

By default, patches are reviewed and discussed.  If they then get
revised and tested and appear to be sensible and useful they will
probably get accepted eventually.

A change of this magnitude would almost certainly need to go through
several iterations of revision and have substantial testing before
being accepted.

NeilBrown
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html