Re: [PATCH] Don't jump to compute_result state from check_result state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Apr 8, 2019 at 4:29 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>
> On Mon, Apr 8, 2019 at 4:18 PM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
> [..]
> > > > To trigger this issue, you not only need a failed disk but to also
> > > > perform a scrubbing operation.  The customer's systems both crashed
> > > > early Sunday morning when the raid-check script is run by default from cron.
> > >
> > > Ok, I follow this, but I come to a different answer on the required
> > > fix. I think it is simply the following to preserve the writeback
> > > action after the parity check, because we need the failed Q slot to be
> > > written back if we're recovering. P will be not up-to-date because it
> > > was checked with the good disks, but sh->ops.zero_sum_result will be
> > > 0, so that will skip the writeback of a !uptodate P value.
> > >
> > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > > index c033bfcb209e..e2eb59289346 100644
> > > --- a/drivers/md/raid5.c
> > > +++ b/drivers/md/raid5.c
> > > @@ -4187,7 +4187,6 @@ static void handle_parity_checks6(struct r5conf
> > > *conf, struct stripe_head *sh,
> > >                 /* now write out any block on a failed drive,
> > >                  * or P or Q if they were recomputed
> > >                  */
> > > -               BUG_ON(s->uptodate < disks - 1); /* We don't need Q to
> > > recover */
> >
> > Thanks Dan!
> >
> > Would it make sense to rework the check as
> >
> >   BUG_ON(s->uptodate < disks - 2);
>
> I think the problem is that any 'uptodate' vs 'disks' check is not
> precise enough in this path. What might be better is to put
> "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the devices that
> might try to kick off writes and then skip the action. Better to
> prevent the raid driver from taking unexpected action *and* keep the
> system alive vs killing the machine with BUG_ON.
>
> BUG_ON has fallen out of favor for exception reporting since those
> assertions were introduced.

Thanks Dan!

Nigel and Xiao, could you please share your thoughts on this? Personally,
I feel this is the right fix.

Best,
Song



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux