Re: [PATCH] Don't jump to compute_result state from check_result state

Song Liu <liu.song.a23@xxxxxxxxx> · Tue, 16 Apr 2019 14:18:39 -0700

On Tue, Apr 16, 2019 at 1:24 PM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
>
>
> On 4/16/19 1:40 PM, Song Liu wrote:
> > On Mon, Apr 15, 2019 at 1:53 PM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
> >>
> >>
> >> On 04/15/2019 03:24 PM, Song Liu wrote:
> >>> On Mon, Apr 15, 2019 at 10:54 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
> >>>>
> >>>> On 04/15/2019 01:44 PM, Song Liu wrote:
> >>>>> Hi Nigel,
> >>>>>
> >>>>> On Thu, Apr 11, 2019 at 10:21 AM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
> >>>>>> On Tue, Apr 9, 2019 at 12:26 PM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
> >>>>>>> On 4/9/19 2:09 PM, John Stoffel wrote:
> >>>>>>>>>>>>> "Dan" == Dan Williams <dan.j.williams@xxxxxxxxx> writes:
> >>>>>>>> Dan> On Mon, Apr 8, 2019 at 4:18 PM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
> >>>>>>>> Dan> [..]
> >>>>>>>>>>>> To trigger this issue, you not only need a failed disk but to also
> >>>>>>>>>>>> perform a scrubbing operation.  The customer's systems both crashed
> >>>>>>>>>>>> early Sunday morning when the raid-check script is run by default from cron.
> >>>>>>>>>>> Ok, I follow this, but I come to a different answer on the required
> >>>>>>>>>>> fix. I think it is simply the following to preserve the writeback
> >>>>>>>>>>> action after the parity check, because we need the failed Q slot to be
> >>>>>>>>>>> written back if we're recovering. P will be not up-to-date because it
> >>>>>>>>>>> was checked with the good disks, but sh->ops.zero_sum_result will be
> >>>>>>>>>>> 0, so that will skip the writeback of a !uptodate P value.
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >>>>>>>>>>> index c033bfcb209e..e2eb59289346 100644
> >>>>>>>>>>> --- a/drivers/md/raid5.c
> >>>>>>>>>>> +++ b/drivers/md/raid5.c
> >>>>>>>>>>> @@ -4187,7 +4187,6 @@ static void handle_parity_checks6(struct r5conf
> >>>>>>>>>>> *conf, struct stripe_head *sh,
> >>>>>>>>>>>                     /* now write out any block on a failed drive,
> >>>>>>>>>>>                      * or P or Q if they were recomputed
> >>>>>>>>>>>                      */
> >>>>>>>>>>> -               BUG_ON(s->uptodate < disks - 1); /* We don't need Q to
> >>>>>>>>>>> recover */
> >>>>>>>>>> Thanks Dan!
> >>>>>>>>>>
> >>>>>>>>>> Would it make sense to rework the check as
> >>>>>>>>>>
> >>>>>>>>>> BUG_ON(s->uptodate < disks - 2);
> >>>>>>>> Dan> I think the problem is that any 'uptodate' vs 'disks' check is
> >>>>>>>> Dan> not precise enough in this path. What might be better is to put
> >>>>>>>> Dan> "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the devices that
> >>>>>>>> Dan> might try to kick off writes and then skip the action. Better to
> >>>>>>>> Dan> prevent the raid driver from taking unexpected action *and* keep
> >>>>>>>> Dan> the system alive vs killing the machine with BUG_ON.
> >>>>>>>>
> >>>>>>>> Dan> BUG_ON has fallen out of favor for exception reporting since
> >>>>>>>> Dan> those assertions were introduced.
> >>>>>>>>
> >>>>>>>> And since it' causes the system to crash... it's super annoying when
> >>>>>>>> the rest of the system is working fine.  Please only use a WARN_ON,
> >>>>>>>> and maybe even set the RAID volume readonly, etc.  But don't bring
> >>>>>>>> down the rest of the system if possible.
> >>>>>>>>
> >>>>>>>> John
> >>>>>>> I reverted the first patch as it made its way upstream.
> >>>>>>>
> >>>>>>> Testing this change now.
> >>>>>>>
> >>>>>>> ---
> >>>>>>>
> >>>>>>>      drivers/md/raid5.c | 2 +-
> >>>>>>>      1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >>>>>>> index c033bfcb209e..660ca3af2914 100644
> >>>>>>> --- a/drivers/md/raid5.c
> >>>>>>> +++ b/drivers/md/raid5.c
> >>>>>>> @@ -4187,7 +4187,7 @@ static void handle_parity_checks6(struct r5conf
> >>>>>>> *conf, struct stripe_head *sh,
> >>>>>>>              /* now write out any block on a failed drive,
> >>>>>>>               * or P or Q if they were recomputed
> >>>>>>>               */
> >>>>>>> -        BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
> >>>>>>> +        WARN_ON(s->uptodate < disks - 2); /* We don't need Q to recover */
> >>>>>> I think this WARN_ON() is the best way to go, though the comment "don't need Q"
> >>>>>> needs some revise.
> >>>>>>
> >>>>>> Nigel, how does this work in your tests?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Song
> >>>>> Could you please share updates about the tests?
> >>>>>
> >>>>> Thanks,
> >>>>> Song
> >>>> Hello Song,
> >>>> Our customer has confirmed, the issue is fixed. He had run this on two
> >>>> systems with defective Disks and run raid-check. There were some errors
> >>>> corrected, the device was degraded, but the kernel never crashed.
> >>>>
> >>>> -Nigel
> >>> Thanks Nigel!
> >>>
> >>> Could you please submit official patch?
> >>>
> >>> Song
> >>>
> >>>>>>>              if (s->failed == 2) {
> >>>>>>>                  dev = &sh->dev[s->failed_num[1]];
> >>>>>>>                  s->locked++;
> >>>>>>> --
> >>>>>>> 2.20.1
> >>>>>>>
> >> Song,   You will need to revert 4f4fd7c5798 before applying this:
> >> -Nigel
> >>
> >> From: Nigel Croxon <ncroxon@xxxxxxxxxx>
> >> Date: Wed, 10 Apr 2019 09:23:34 -0400
> >> Subject: [PATCH ] preserve the writeback action after the parity check
> >>
> >> The problem is that any 'uptodate' vs 'disks' check is not precise
> >> in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the
> >> device that might try to kick off writes and then skip the action.
> >> Better to prevent the raid driver from taking unexpected action *and* keep
> >> the system alive vs killing the machine with BUG_ON.
> >>
> >> Fixes: 4f4fd7c5798 Don't jump to compute_result state from check_result
> >> state
> >>
> >> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
> >> Signed-off-by: Nigel Croxon <ncroxon@xxxxxxxxxx>
> >> ---
> >>    drivers/md/raid5.c | 10 +++++++++-
> >>    1 file changed, 9 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >> index f957b7e42562..4f6b32366c00 100644
> >> --- a/drivers/md/raid5.c
> >> +++ b/drivers/md/raid5.c
> >> @@ -4187,7 +4187,7 @@ static void handle_parity_checks6(struct r5conf
> >> *conf, struct stripe_head *sh,
> >>            /* now write out any block on a failed drive,
> >>             * or P or Q if they were recomputed
> >>             */
> >> -        BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
> >> +        dev = NULL;
> >>            if (s->failed == 2) {
> >>                dev = &sh->dev[s->failed_num[1]];
> >>                s->locked++;
> >> @@ -4212,6 +4212,14 @@ static void handle_parity_checks6(struct r5conf
> >> *conf, struct stripe_head *sh,
> >>                set_bit(R5_LOCKED, &dev->flags);
> >>                set_bit(R5_Wantwrite, &dev->flags);
> >>            }
> >> +        if (WARN_ONCE(dev && !test_bit(R5_UPTODATE, &dev->flags),
> >> +                    "%s: disk%ld not up to date\n",
> >> +                    mdname(conf->mddev),
> >> +                    dev - (struct r5dev *) &sh->dev)) {
> >> +            clear_bit(R5_LOCKED, &dev->flags);
> >> +            clear_bit(R5_Wantwrite, &dev->flags);
> >> +            s->locked--;
> >> +        }
> >>            clear_bit(STRIPE_DEGRADED, &sh->state);
> >>
> >>            set_bit(STRIPE_INSYNC, &sh->state);
> >> --
> >> 2.20.1
> >>
> >>
> > Thanks Nigel!
> >
> > Pushed to https://github.com/liu-song-6/linux/tree/md-next .
> >
> > Song
>
> From: Nigel Croxon <ncroxon@xxxxxxxxxx>
> Date: Tue, 16 Apr 2019 15:37:30 -0400
> Subject: [PATCH] preserve the writeback action after the parity check
>
>
> The problem is that any 'uptodate' vs 'disks' check is not precise
> in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the
> device that might try to kick off writes and then skip the action.
> Better to prevent the raid driver from taking unexpected action *and* keep
> the system alive vs killing the machine with BUG_ON.
>
> V2: change format '%ld' argument of type 'long int', to 'int'.
>
> Fixes: 4f4fd7c5798 Don't jump to compute_result state from check_result
> state
>
> Signed-off-by: Dan Williams <dan.j.williams@xxxxxxxxx>
> Signed-off-by: Nigel Croxon <ncroxon@xxxxxxxxxx>

How about something like this?

https://github.com/liu-song-6/linux/commit/88b4458c52ecbc3e016ec95ef866f19afe4550a0

> ---
>   drivers/md/raid5.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index f957b7e42562..395e7ec0fb6e 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -4187,7 +4187,7 @@ static void handle_parity_checks6(struct r5conf
> *conf, struct stripe_head *sh,
>           /* now write out any block on a failed drive,
>            * or P or Q if they were recomputed
>            */
> -        BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */
> +        dev = NULL;
>           if (s->failed == 2) {
>               dev = &sh->dev[s->failed_num[1]];
>               s->locked++;
> @@ -4212,6 +4212,14 @@ static void handle_parity_checks6(struct r5conf
> *conf, struct stripe_head *sh,
>               set_bit(R5_LOCKED, &dev->flags);
>               set_bit(R5_Wantwrite, &dev->flags);
>           }
> +        if (WARN_ONCE(dev && !test_bit(R5_UPTODATE, &dev->flags),
> +            "%s: disk%lx not up to date\n",
> +            mdname(conf->mddev),
> +            dev - (struct r5dev *) &sh->dev)) {
> +            clear_bit(R5_LOCKED, &dev->flags);
> +            clear_bit(R5_Wantwrite, &dev->flags);
> +            s->locked--;
> +        }
>           clear_bit(STRIPE_DEGRADED, &sh->state);
>
>           set_bit(STRIPE_INSYNC, &sh->state);
> --
> 2.20.1
>
>
>
>
>