Re: Fault tolerance with badblocks

NeilBrown <neilb@xxxxxxxx> · Tue, 16 May 2017 13:20:08 +1000

On Wed, May 10 2017, Wols Lists wrote:

> On 10/05/17 04:53, Chris Murphy wrote:
>> 
>> The data is already corrupted by definition. No additional damage to
>> data is done. What does happen is good P and Q are replaced by bad P
>> and Q which matches the already bad data.
>
> Except, in my world, replacing good P & Q by bad P & Q *IS* doing
> additional damage! We can identify and fix the bad data. So why don't
> we? Throwing away good P & Q prevents us from doing that, and means we
> can no longer recover the good data!
>> 
>> And nevertheless you have the very real problem that drives lie about
>> having committed data to stable media. And they reorder writes,
>> breaking the write order assumptions of things. And we have RMW
>> happening on live arrays. And that means you have a real likelihood
>> that you cannot absolutely determine with the available information
>> why P and Q don't agree with the data, you're still making probability
>> assumptions and if that assumption is wrong any correction will
>> introduce more corruption.
>> 
>> The only unambiguous way to do this has already been done and it's ZFS
>> and Btrfs. And a big part of why they can do what they do is because
>> they are copy on write. IIf you need to solve the problem of ambiguous
>> data strip integrity in relation to P and Q, then use ZFS. It's
>> production ready. If you are prepared to help test and improve things,
>> then you can look into the Btrfs implementation.
>
> So how come btrfs and ZFS can handle this, and md can't? Can't md use
> the same techniques. (Seriously, I don't know the answer.

Security theater?
I don't actually know what, specifically, btrfs and ZFS do, so I cannot
say for certain.  But I am far from convinced by what I know.

I come back to the same question I always come back to.  Is there a
likely cause for a particular anomaly, and does a particular action
properly respond to that cause.  I don't like addressing symptoms, I
like addressing causes.

In the case of a resync after an unclean shutdown, if I find a stripe in
which P and Q are not consistent with the data, then a likely cause is
that some, but not all, blocks in a new stripe were written just before
the crash.  If the array is not degraded, it is likely that the data is
all valid and P and Q are not needed.  So it makes sense to regenerate P
and Q.  Other responses might also make sense, but they don't make
*more* sense.  And regenerating P and Q is obvious and easy.  If the
array is degraded and a Data block is lost, there is no reliable way to
recover that block.  So md refuses the start the array by default.

If you find an inconsistent data block during a scrub, then I have no
idea what could have caused that, so I cannot suggest anything
(actually I have lots of ideas, but most of them suggest you should
replace your hardware and test your backups). Maybe there is a way to
recover data, maybe there is no need.  I cannot tell.  raid6recover is a
tool that can be used by a sysadmin to explore options.  Maybe not a
perfect tool, but it has some uses.

>                                                           But, like Nix,
> when I feel I'm being fed the answer "we're not going to give you the
> choice because we know better than you", I get cheesed off. If I get the
> answer "we're snowed under, do it yourself" then that is normal and
> acceptable.)

The main reason I have never implemented your idea of "validate every
block before reporting a successful read" is that I genuinely don't
think many people would use it.  Writing code that won't be used is not
very rewarding.
The simple way to provide evidence to the contrary is to turn the
interest into cash.  If 1000 people all give $10 to get it done, I
suspect we could make it happen.

>> 
>> Otherwise I'm sure md and LVM folks have a feature list that
>> represents a few years of work as it is without yet another pile on.
>> 
>>>
>>> Report the error, give the user the tools to fix it, and LET THEM sort
>>> it out. Just like we do when we run fsck on a filesystem.
>> 
>> They're not at all comparable. One is a file system, the other a raid
>> implementation, they have nothing in common.
>> 
>> 
> And what are file systems and raid implementations? They are both data
> store abstractions. They have everything in common.
>
> Oh and by the way, now I've realised my mistake, I've taken a look at
> the paper you mention. In particular, section 4. Yes it does say you
> can't detect and correct multi-disk errors - but that's not what we're
> asking for!
>
> By implication, it seems to be saying LOUD AND CLEAR that you CAN detect
> and correct a single-disk error. So why the blankety-blank won't md let
> you do that!
>
> Neil's point seems to be that it's a bad idea to do it automatically. I
> get his logic. But to then actively prevent you doing it manually - this
> is the paternalistic attitude that gets my goat.

I'm certainly not actively preventing you.  I certainly wouldn't object
to a patch which reports the details of mismatches.  I myself was never
motivated enough to write one. That might be inactively preventing you,
but not actively preventing you.

NeilBrown
Attachment:
signature.asc

Description: PGP signature