On Wed, May 10 2017, Wols Lists wrote: > On 10/05/17 04:53, Chris Murphy wrote: >> >> The data is already corrupted by definition. No additional damage to >> data is done. What does happen is good P and Q are replaced by bad P >> and Q which matches the already bad data. > > Except, in my world, replacing good P & Q by bad P & Q *IS* doing > additional damage! We can identify and fix the bad data. So why don't > we? Throwing away good P & Q prevents us from doing that, and means we > can no longer recover the good data! >> >> And nevertheless you have the very real problem that drives lie about >> having committed data to stable media. And they reorder writes, >> breaking the write order assumptions of things. And we have RMW >> happening on live arrays. And that means you have a real likelihood >> that you cannot absolutely determine with the available information >> why P and Q don't agree with the data, you're still making probability >> assumptions and if that assumption is wrong any correction will >> introduce more corruption. >> >> The only unambiguous way to do this has already been done and it's ZFS >> and Btrfs. And a big part of why they can do what they do is because >> they are copy on write. IIf you need to solve the problem of ambiguous >> data strip integrity in relation to P and Q, then use ZFS. It's >> production ready. If you are prepared to help test and improve things, >> then you can look into the Btrfs implementation. > > So how come btrfs and ZFS can handle this, and md can't? Can't md use > the same techniques. (Seriously, I don't know the answer. Security theater? I don't actually know what, specifically, btrfs and ZFS do, so I cannot say for certain. But I am far from convinced by what I know. I come back to the same question I always come back to. Is there a likely cause for a particular anomaly, and does a particular action properly respond to that cause. I don't like addressing symptoms, I like addressing causes. In the case of a resync after an unclean shutdown, if I find a stripe in which P and Q are not consistent with the data, then a likely cause is that some, but not all, blocks in a new stripe were written just before the crash. If the array is not degraded, it is likely that the data is all valid and P and Q are not needed. So it makes sense to regenerate P and Q. Other responses might also make sense, but they don't make *more* sense. And regenerating P and Q is obvious and easy. If the array is degraded and a Data block is lost, there is no reliable way to recover that block. So md refuses the start the array by default. If you find an inconsistent data block during a scrub, then I have no idea what could have caused that, so I cannot suggest anything (actually I have lots of ideas, but most of them suggest you should replace your hardware and test your backups). Maybe there is a way to recover data, maybe there is no need. I cannot tell. raid6recover is a tool that can be used by a sysadmin to explore options. Maybe not a perfect tool, but it has some uses. > But, like Nix, > when I feel I'm being fed the answer "we're not going to give you the > choice because we know better than you", I get cheesed off. If I get the > answer "we're snowed under, do it yourself" then that is normal and > acceptable.) The main reason I have never implemented your idea of "validate every block before reporting a successful read" is that I genuinely don't think many people would use it. Writing code that won't be used is not very rewarding. The simple way to provide evidence to the contrary is to turn the interest into cash. If 1000 people all give $10 to get it done, I suspect we could make it happen. >> >> Otherwise I'm sure md and LVM folks have a feature list that >> represents a few years of work as it is without yet another pile on. >> >>> >>> Report the error, give the user the tools to fix it, and LET THEM sort >>> it out. Just like we do when we run fsck on a filesystem. >> >> They're not at all comparable. One is a file system, the other a raid >> implementation, they have nothing in common. >> >> > And what are file systems and raid implementations? They are both data > store abstractions. They have everything in common. > > Oh and by the way, now I've realised my mistake, I've taken a look at > the paper you mention. In particular, section 4. Yes it does say you > can't detect and correct multi-disk errors - but that's not what we're > asking for! > > By implication, it seems to be saying LOUD AND CLEAR that you CAN detect > and correct a single-disk error. So why the blankety-blank won't md let > you do that! > > Neil's point seems to be that it's a bad idea to do it automatically. I > get his logic. But to then actively prevent you doing it manually - this > is the paternalistic attitude that gets my goat. I'm certainly not actively preventing you. I certainly wouldn't object to a patch which reports the details of mismatches. I myself was never motivated enough to write one. That might be inactively preventing you, but not actively preventing you. NeilBrown
Attachment:
signature.asc
Description: PGP signature