Re: Fault tolerance with badblocks

Wols Lists <antlists@xxxxxxxxxxxxxxx> · Tue, 9 May 2017 21:52:39 +0100

On 09/05/17 21:18, Nix wrote:
> (Neil said: "Similarly a RAID6 with inconsistent P and Q could well not
> be able to identify a single block which is "wrong" and even if it could
> there is a small possibility that the identified block isn't wrong, but
> the other blocks are all inconsistent in such a way as to accidentally
> point to it. The probability of this is rather small, but it is
> non-zero". As far as I can tell the probability of this is exactly the
> same as that of multiple read errors in a single stripe -- possibly far
> lower, if you need not only multiple wrong P and Q values but *precisely
> mis-chosen* ones. If that wasn't acceptably rare, you wouldn't be using
> RAID-6 to begin with.

This to me is the crux of the argument.

What is the probability of CORRECTLY identifying a single-disk error?

What is the probability of WRONGLY mistaking a multi-disk error for a
single-disk error?

My gut instinct is that the second scenario is much less likely. So, in
that case, the current setup is that we DELIBERATELY CORRUPT a
recoverable error because of the TINY risk that we might have got it
wrong. Picking probabilities at random, let's say the first probability
is 99 in a hundred, the second is one in a thousand.

On a four-disk raid-6, that means we're throwing away about 500 chances
of recovering the correct data, so that on one occasion we can avoid
corruption. To me that's an insane trade-off.

Neil goes on about "what if a write fails? What if the power goes down?
What if what if?" Those are the wrong questions!!! The correct question
is "can we identify the difference between a single-disk failure and a
multi-disk failure". We don't care what *caused* that failure.

If the power goes down and only the first disk in a stripe is written,
we can correct it back to what it was. If only the last disk failed to
be written, we can correct it back to what it should have been. If at
least two disks are written and at least two disks are not, CAN WE
DETECT THAT? Surely we can - we don't care how many disks are or aren't
written - in that scenario surely all the parities mess up. In which
case we give up and say "corrupt data". Which is no different from at
present other than at present we fix the parity and pretend nothing is
wrong :-(

The problem is that at present we fix the parity and pretend nothing is
wrong when the reality is we *could* have corrected the data, if we
could have been bothered.

So we have to write an mdfsck. Okay. So we have to make sure that no
filesystems on the array are mounted. Okay, that's a bit harder. So we
have to assume that sysadmins are sensible beings who don't screw things
up - okay that's a lot harder :-) But we shouldn't be throwing away LOTS
of data that's easy to recover, because we MIGHT "recover" data that's
wrong.

Yes, yes, I know - code welcome ... :-)

Cheers,
Wol
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html