On 09/05/17 21:18, Nix wrote: > (Neil said: "Similarly a RAID6 with inconsistent P and Q could well not > be able to identify a single block which is "wrong" and even if it could > there is a small possibility that the identified block isn't wrong, but > the other blocks are all inconsistent in such a way as to accidentally > point to it. The probability of this is rather small, but it is > non-zero". As far as I can tell the probability of this is exactly the > same as that of multiple read errors in a single stripe -- possibly far > lower, if you need not only multiple wrong P and Q values but *precisely > mis-chosen* ones. If that wasn't acceptably rare, you wouldn't be using > RAID-6 to begin with. This to me is the crux of the argument. What is the probability of CORRECTLY identifying a single-disk error? What is the probability of WRONGLY mistaking a multi-disk error for a single-disk error? My gut instinct is that the second scenario is much less likely. So, in that case, the current setup is that we DELIBERATELY CORRUPT a recoverable error because of the TINY risk that we might have got it wrong. Picking probabilities at random, let's say the first probability is 99 in a hundred, the second is one in a thousand. On a four-disk raid-6, that means we're throwing away about 500 chances of recovering the correct data, so that on one occasion we can avoid corruption. To me that's an insane trade-off. Neil goes on about "what if a write fails? What if the power goes down? What if what if?" Those are the wrong questions!!! The correct question is "can we identify the difference between a single-disk failure and a multi-disk failure". We don't care what *caused* that failure. If the power goes down and only the first disk in a stripe is written, we can correct it back to what it was. If only the last disk failed to be written, we can correct it back to what it should have been. If at least two disks are written and at least two disks are not, CAN WE DETECT THAT? Surely we can - we don't care how many disks are or aren't written - in that scenario surely all the parities mess up. In which case we give up and say "corrupt data". Which is no different from at present other than at present we fix the parity and pretend nothing is wrong :-( The problem is that at present we fix the parity and pretend nothing is wrong when the reality is we *could* have corrected the data, if we could have been bothered. So we have to write an mdfsck. Okay. So we have to make sure that no filesystems on the array are mounted. Okay, that's a bit harder. So we have to assume that sysadmins are sensible beings who don't screw things up - okay that's a lot harder :-) But we shouldn't be throwing away LOTS of data that's easy to recover, because we MIGHT "recover" data that's wrong. Yes, yes, I know - code welcome ... :-) Cheers, Wol -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html