Re: raid6 with dm-integrity should not cause device to fail

Song Liu <liu.song.a23@xxxxxxxxx> · Thu, 5 Sep 2019 09:26:12 -0700

On Thu, Sep 5, 2019 at 6:10 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
>
> On 6/20/19 7:31 AM, Nigel Croxon wrote:
> > Hello All,
> >
> > When RAID6 is set up on dm-integrity target that detects massive
> > corruption, the leg will be ejected from the array.  Even if the issue
> > is correctable with a sector re-write and the array has necessary
> > redundancy to correct it.
> >
> > The leg is ejected because it runs up the rdev->read_errors beyond
> > conf->max_nr_stripes (600).
> >
> > The return status in dm-crypt when there is a data integrity error is
> > BLK_STS_PROTECTION.
> >
> > I propose we don't increment the read_errors when the bi->bi_status is
> > BLK_STS_PROTECTION.
> >
> >
> >  drivers/md/raid5.c | 3 ++-
> >  1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index b83bce2beb66..ca73e60e33ed 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -2526,7 +2526,8 @@ static void raid5_end_read_request(struct bio * bi)
> >          int set_bad = 0;
> >
> >          clear_bit(R5_UPTODATE, &sh->dev[i].flags);
> > -        atomic_inc(&rdev->read_errors);
> > +        if (!(bi->bi_status == BLK_STS_PROTECTION))
> > +            atomic_inc(&rdev->read_errors);
> >          if (test_bit(R5_ReadRepl, &sh->dev[i].flags))
> >              pr_warn_ratelimited(
> >                  "md/raid:%s: read error on replacement device (sector
> > %llu on %s).\n",
>
>
> I'm up against this wall again.  We should continue to count errors
> returned by the lower layer,
>
> but if those errors are -EILSEQ, instead of -EIO, MD should not mark the
> device as failed.
>

Sorry for the very late reply.

I think the change is on the right direction. Please submit official patch so
we can discuss the details.

Thanks,
Song