On Mon, Apr 8, 2019 at 4:29 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > On Mon, Apr 8, 2019 at 4:18 PM Song Liu <liu.song.a23@xxxxxxxxx> wrote: > [..] > > > > To trigger this issue, you not only need a failed disk but to also > > > > perform a scrubbing operation. The customer's systems both crashed > > > > early Sunday morning when the raid-check script is run by default from cron. > > > > > > Ok, I follow this, but I come to a different answer on the required > > > fix. I think it is simply the following to preserve the writeback > > > action after the parity check, because we need the failed Q slot to be > > > written back if we're recovering. P will be not up-to-date because it > > > was checked with the good disks, but sh->ops.zero_sum_result will be > > > 0, so that will skip the writeback of a !uptodate P value. > > > > > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > > > index c033bfcb209e..e2eb59289346 100644 > > > --- a/drivers/md/raid5.c > > > +++ b/drivers/md/raid5.c > > > @@ -4187,7 +4187,6 @@ static void handle_parity_checks6(struct r5conf > > > *conf, struct stripe_head *sh, > > > /* now write out any block on a failed drive, > > > * or P or Q if they were recomputed > > > */ > > > - BUG_ON(s->uptodate < disks - 1); /* We don't need Q to > > > recover */ > > > > Thanks Dan! > > > > Would it make sense to rework the check as > > > > BUG_ON(s->uptodate < disks - 2); > > I think the problem is that any 'uptodate' vs 'disks' check is not > precise enough in this path. What might be better is to put > "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the devices that > might try to kick off writes and then skip the action. Better to > prevent the raid driver from taking unexpected action *and* keep the > system alive vs killing the machine with BUG_ON. > > BUG_ON has fallen out of favor for exception reporting since those > assertions were introduced. Thanks Dan! Nigel and Xiao, could you please share your thoughts on this? Personally, I feel this is the right fix. Best, Song