Re: [PATCH] Don't jump to compute_result state from check_result state

Song Liu <liu.song.a23@xxxxxxxxx> · Tue, 9 Apr 2019 10:43:02 -0700

On Mon, Apr 8, 2019 at 4:29 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>
> On Mon, Apr 8, 2019 at 4:18 PM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
> [..]
> > > > To trigger this issue, you not only need a failed disk but to also
> > > > perform a scrubbing operation.  The customer's systems both crashed
> > > > early Sunday morning when the raid-check script is run by default from cron.
> > >
> > > Ok, I follow this, but I come to a different answer on the required
> > > fix. I think it is simply the following to preserve the writeback
> > > action after the parity check, because we need the failed Q slot to be
> > > written back if we're recovering. P will be not up-to-date because it
> > > was checked with the good disks, but sh->ops.zero_sum_result will be
> > > 0, so that will skip the writeback of a !uptodate P value.
> > >
> > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > > index c033bfcb209e..e2eb59289346 100644
> > > --- a/drivers/md/raid5.c
> > > +++ b/drivers/md/raid5.c
> > > @@ -4187,7 +4187,6 @@ static void handle_parity_checks6(struct r5conf
> > > *conf, struct stripe_head *sh,
> > >                 /* now write out any block on a failed drive,
> > >                  * or P or Q if they were recomputed
> > >                  */
> > > -               BUG_ON(s->uptodate < disks - 1); /* We don't need Q to
> > > recover */
> >
> > Thanks Dan!
> >
> > Would it make sense to rework the check as
> >
> >   BUG_ON(s->uptodate < disks - 2);
>
> I think the problem is that any 'uptodate' vs 'disks' check is not
> precise enough in this path. What might be better is to put
> "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the devices that
> might try to kick off writes and then skip the action. Better to
> prevent the raid driver from taking unexpected action *and* keep the
> system alive vs killing the machine with BUG_ON.
>
> BUG_ON has fallen out of favor for exception reporting since those
> assertions were introduced.

Thanks Dan!

Nigel and Xiao, could you please share your thoughts on this? Personally,
I feel this is the right fix.

Best,
Song