Re: [PATCH] raid5 improve too many read errors msg by adding limits

Song Liu <liu.song.a23@xxxxxxxxx> · Tue, 20 Aug 2019 14:41:09 -0700

On Tue, Aug 20, 2019 at 7:30 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
>
>
> On 8/16/19 7:52 PM, Song Liu wrote:
> > On Fri, Aug 16, 2019 at 10:02 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
> > [...]
> >> [  +0.000008] md/raid:md127: 793 read_errors, > 781 stripes
> >> [  +0.000001] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000018] md/raid:md127: 794 read_errors, > 781 stripes
> >> [  +0.000000] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000009] md/raid:md127: 795 read_errors, > 781 stripes
> >> [  +0.000001] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000008] md/raid:md127: 796 read_errors, > 781 stripes
> >> [  +0.000000] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000018] md/raid:md127: 797 read_errors, > 781 stripes
> >> [  +0.000001] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000008] md/raid:md127: 798 read_errors, > 781 stripes
> >> [  +0.000001] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000017] md/raid:md127: 799 read_errors, > 781 stripes
> >> [  +0.000001] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000008] md/raid:md127: 800 read_errors, > 781 stripes
> >> [  +0.000001] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000008] md/raid:md127: 801 read_errors, > 781 stripes
> >> [  +0.000000] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000021] md/raid:md127: 802 read_errors, > 781 stripes
> >> [  +0.000000] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000009] md/raid:md127: 803 read_errors, > 781 stripes
> >> [  +0.000000] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000009] md/raid:md127: 804 read_errors, > 781 stripes
> >> [  +0.000000] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.000008] md/raid:md127: 805 read_errors, > 781 stripes
> >> [  +0.000001] md/raid:md127: Too many read errors, failing device dm-0.
> >> [  +0.928614] md: md127: requested-resync interrupted.
> >>
> > This is a little too noisy. How about we only pr_warn() for
> > test_bit(Faulty) == 0?
> > (This is not directly related to this patch, but since we are at it).
> >
> > Thanks,
> > Song
> From: Nigel Croxon <ncroxon@xxxxxxxxxx>
> Date: Mon, 19 Aug 2019 16:01:04 -0400
> Subject: [PATCH]  raid5 improve too many read errors msg by adding limits
>
> Often limits can be changed by admin. When discussing such things
> it helps if you can provide "self-sustained" facts. Also
> sometimes the admin thinks he changed a limit, but it did not
> take effect for some reason or he changed the wrong thing.
>
> V3: Only pr_warn when Faulty is 0.
> V2: Add read_errors value to pr_warn.
>
> Signed-off-by: Nigel Croxon <ncroxon@xxxxxxxxxx>
> ---
>   drivers/md/raid5.c | 13 +++++++++----
>   1 file changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7fde645d2e90..6812cefea308 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -2557,10 +2557,15 @@ static void raid5_end_read_request(struct bio * bi)
>                   (unsigned long long)s,
>                   bdn);
>           } else if (atomic_read(&rdev->read_errors)
> -             > conf->max_nr_stripes)
> -            pr_warn("md/raid:%s: Too many read errors, failing device
> %s.\n",
> -                   mdname(conf->mddev), bdn);
> -        else
> +            > conf->max_nr_stripes) {
> +            if (!test_bit(Faulty, &rdev->flags)) {
> +                pr_warn("md/raid:%s: %d read_errors, > %d stripes\n",
> +                   mdname(conf->mddev), atomic_read(&rdev->read_errors),
> +                   conf->max_nr_stripes);
> +                pr_warn("md/raid:%s: Too many read errors, failing
> device %s.\n",
> +                   mdname(conf->mddev), bdn);
> +            }
> +        } else
>               retry = 1;
>           if (set_bad && test_bit(In_sync, &rdev->flags)
>               && !test_bit(R5_ReadNoMerge, &sh->dev[i].flags))
> --

This looks good, but I have got some git issue applying the patch.

Please double check with ./scripts/checkpatch.pl and resend with git-send-email.

Thanks,
Song