raid5/6 error amplification during normal operations

<3@xxxxx> · Mon, 8 Jul 2024 05:01:59 +0200

Hello, 

As far as I understand, when working with raid5/6 MD driver does not check parity during normal read operations. When doing scrubs - when parity mismatch is detected, parity data is assumed incorrect and is re-regenerated.
Modern HDDs have error read rates ~1 sector per 10^15 bits read, or 1 incorrect sector red per ~125Tb of reads. 
When operating 10x20Tb array in raid6 configuration, during weekly scrubs we will read 50'000Tb of data over 5 years of operation. 

With perfectly working drives and stable ECC RAM we still will get 400 incorrect sectors red over 5 years. 20% of these will be in parity drives and will be correctly regenerated by MD driver. 
But unfortunately, in 80% of cases - error will propagate to parity drives, and at the end we will get 400*0.8*3=960 sectors with incorrect data, even though all disks are working on-spec without any hardware failures. 

I.e. while raid-6 does improve tolerance to detectable hardware failures, random undetectable errors are actually amplified dramatically (due to regular scrubs & 80% chance of copying errors on parity drives).
It means that due to dramatic increase of HDD size over last 20 years - intrinsic error rates of HDDs can no longer be ignored. They are as serious threat as drive failure. 

1) It is time to integrate raid6check into MD driver to ensure all scrubs are correctly recovering errors. This will reduce raid-6 error rate by more than 10^9 (as we will have to randomly get 2 errors while reading different drives in the same place which is unlikely). As this recovery code only triggers during parity mismatches (which are rare), there will be no performance degradation. Right now we just leave data integrity on the table and mislead users : many are certain that raid-6 can correct random errors, when in reality it does not. 

2) raid-5/raid-1 will require external checksums in external file (just like bitmap) + checksum_actual flag in bitmap to avoid guaranteed continuous corruption caused by scrubs and intrinsic error rates of HDDs. This will allow MD driver to know which copy of data is correct. Checksums can be handled just like write intent bitmap : write first, update checksums later at idle time. Running mdadm on top of dm-integrity cannot support this delayed checksum update and is dramatically slower. 64-bit checksums could also allow experienced users to manually recover from single bit flip errors even when there is no more parity left. 

Best regards,
Mikhail