Re: raid6 with dm-integrity should not cause device to fail

Nigel Croxon <ncroxon@xxxxxxxxxx> · Thu, 5 Sep 2019 11:29:07 -0400

On 9/5/19 7:35 AM, Nigel Croxon wrote:
On 6/20/19 7:31 AM, Nigel Croxon wrote:
Hello All,

When RAID6 is set up on dm-integrity target that detects massive 
corruption, the leg will be ejected from the array.  Even if the 
issue is correctable with a sector re-write and the array has 
necessary redundancy to correct it.

The leg is ejected because it runs up the rdev->read_errors beyond 
conf->max_nr_stripes (600).

The return status in dm-crypt when there is a data integrity error is 
BLK_STS_PROTECTION.

I propose we don't increment the read_errors when the bi->bi_status 
is BLK_STS_PROTECTION.


 drivers/md/raid5.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b83bce2beb66..ca73e60e33ed 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2526,7 +2526,8 @@ static void raid5_end_read_request(struct bio * 
bi)
         int set_bad = 0;

         clear_bit(R5_UPTODATE, &sh->dev[i].flags);
-        atomic_inc(&rdev->read_errors);
+        if (!(bi->bi_status == BLK_STS_PROTECTION))
+            atomic_inc(&rdev->read_errors);
         if (test_bit(R5_ReadRepl, &sh->dev[i].flags))
             pr_warn_ratelimited(
                 "md/raid:%s: read error on replacement device 
(sector %llu on %s).\n",


I'm up against this wall again.  We should continue to count errors 
returned by the lower layer,

but if those errors are -EILSEQ, instead of -EIO, MD should not mark 
the device as failed.


https://securitypitfalls.wordpress.com/2018/05/08/raid-doesnt-work/