On 08/04/2018 02:35 PM, Wols Lists wrote: > On 04/08/18 17:10, donotcare@xxxxxxxxxxx wrote: >> Lastly, is the raid6check program (from mdadm-4.0) considered a safe/reliable way to repair mismatches on raid6 arrays? > > It's the ONLY way to do a guaranteed recovery from a mismatch. This is a > historic thing ... > > On raid 5, if you have a mismatch, you can NOT recover - you have one > extra bit of info namely parity, but two unknowns the faulty drive and > the lost data. So "repair" on a raid 5 simply assumes that the parity is > at fault - which to be honest is the usual case - and recalculates and > rewrites it. If sods law says it's actually a data block that's dud, > sorry your data is gone. > > Because of this, "repair" assumes the same is true for a raid 6 mismatch > - it recalculates and and rewrites the parity. And the chances are this > is the correct thing to do. > > But it isn't always the correct thing to do. What raid6check does is it > actually checks - because it has two extra bits of info, it can work out > two unknowns namely the dud drive and what should be on it. Assumption alert! > Don't expect the current situation to change, because the devs say - > quite reasonably - that ANY attempt to fix your drives without knowing > why they got corrupted is seriously hazardous to your data, and once you > know what went wrong you'll know how to fix it. The algorithm that uses parity and syndrome to determine which disk is wrong presumes that the stripe was originally written together, and all subsequent writes completed, with P & Q. Corruption can be due to a variety of failures that make this no longer true. Like a write success on some sectors of an update but not others. If the system is up, those failed writes will kick those devices, but it might happen on a sudden power-off event. MD would have no way to know, and the raid6check algorithm will be spectacularly wrong. If the array is non-degraded, the only safe operation is to recompute P & Q from the data blocks, and this matches what a filesystem layer will see. raid6check makes sense when you know exactly how your system failed, and you understand what it will do. If you don't understand what happened, using one form of recovery over another is simply flipping a coin. Phil -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html