On 9 May 2017, David Brown uttered the following: > On 09/05/17 11:53, Nix wrote: >> This turns out not to be the case. See this ten-year-old paper: >> <https://indico.cern.ch/event/13797/contributions/1362288/attachments/115080/163419/Data_integrity_v3.pdf>. >> Five weeks of doing 2GiB writes on 3000 nodes once every two hours >> found, they estimated, 50 errors possibly attributable to disk problems >> (sector- or page-size regions of corrupted data) on 1/30th of their >> nodes. This is *not* rare and it is hard to imagine that 1/30th of disks >> used by CERN deserve discarding. It is better to assume that drives >> misdirect writes now and then, and to provide a means of recovering from >> them that does not take days of panic. RAID-6 gives you that means: md >> should use it. > > RAID-6 does not help here. You have to understand the types of errors > that can occur, the reasons for them, the possibilities for detection, > the possibilities for recovery, and what the different layers in the > system can do about them. > > RAID (1/5/6) will let you recover from one or more known failed reads, > on the assumption that the driver firmware is correct, memories have no > errors, buses have no errors, block writes are atomic, write ordering > matches the flush commands, block reads are either correct or marked as > failed, etc. I think you're being too pedantic. Many of these things are known not to be true on real hardware, and at least one of them cannot possibly be true without a journal (atomic block writes). Nonetheless, the md layer is quite happy to rebuild after a failed disk even though the write hole might have torn garbage into your data, on the grounds that it *probably* did not. If your argument was used everywhere, md would never have been started because 100% reliability was not guaranteed. The same, it seems to me, is true of cases in which one drive in a RAID-6 reports a few mismatched blocks. It is true that you don't know the cause of the mismatches, but you *do* know which bit of the mismatch is wrong and what data should be there, subject only to the assumption that sufficiently few drives have made simultaneous mistakes that redundancy is preserved. And that's the same assumption RAID >0 makes all the time anyway! The only difference in the disk-failure case is that you know that one drive has failed without needing to ask other drives to be sure. I mean, yeah, *possibly* in the RAID-6 mismatch case *five* drives have gone simultaneously wrong in such a way that their syndromes all match and the one surviving drive is mistakenly misrepaired, but frankly you'd need to wait for black holes to evaporate of old age before this became an issue. (I'm not suggesting repairing RAID-5 mismatches. That's clearly impossible. You can't even tell what disk is affected. But in the RAID-6 case none of this is impossible, or so it seems to me. You have at least three and probably four or more drives with consistent syndromes, and one that is out of whack. You know which one must be wrong -- the "minority vote" -- and you know what has to be done to make it consistent with the others again. Why not do it? It's no more risky than that aspect of a RAID rebuild from a failed disk would be.) > RAID will /not/ let you reliably detect or correct other sorts of > errors. ... only it clearly can. What stops it from handling the RAID-6-and- one-disk-is-wrong case where it cannot handle the RAID-6-and-one-disk- has-failed case, given that you can unambiguously determine which disk is wrong using the data on the surviving drives, with an undetected- failure probability of something way below 2^128? (I could work out the actual value but I haven't had any coffee yet and it seems pointless when it's that low.) > What does /not/ work, however, is trying to squeeze magic capabilities > out of existing layers in the system, or expecting more out of them that > they can give. I don't see that these capabilities are any more magic than what RAID-6 does already. It can recover from two failed drives: why can't it recover from one wrong one? (Or, rather, from one drive with very occasionally wrong sectors on it. Obviously if it was always getting things wrong its presence is not a benefit and you have essentially fallen back to nothing better than RAID-5, only with worse performance. But that's what error thresholds are for, which md already employs in similar situations.) -- NULL && (void) -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html