Nice! Sent from my iPhone > On Apr 9, 2019, at 3:26 PM, Nigel Croxon <ncroxon@xxxxxxxxxx> wrote: > > > On 4/9/19 2:09 PM, John Stoffel wrote: >>>>>>> "Dan" == Dan Williams <dan.j.williams@xxxxxxxxx> writes: >> Dan> On Mon, Apr 8, 2019 at 4:18 PM Song Liu <liu.song.a23@xxxxxxxxx> wrote: >> Dan> [..] >>>>>> To trigger this issue, you not only need a failed disk but to also >>>>>> perform a scrubbing operation. The customer's systems both crashed >>>>>> early Sunday morning when the raid-check script is run by default from cron. >>>>> Ok, I follow this, but I come to a different answer on the required >>>>> fix. I think it is simply the following to preserve the writeback >>>>> action after the parity check, because we need the failed Q slot to be >>>>> written back if we're recovering. P will be not up-to-date because it >>>>> was checked with the good disks, but sh->ops.zero_sum_result will be >>>>> 0, so that will skip the writeback of a !uptodate P value. >>>>> >>>>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c >>>>> index c033bfcb209e..e2eb59289346 100644 >>>>> --- a/drivers/md/raid5.c >>>>> +++ b/drivers/md/raid5.c >>>>> @@ -4187,7 +4187,6 @@ static void handle_parity_checks6(struct r5conf >>>>> *conf, struct stripe_head *sh, >>>>> /* now write out any block on a failed drive, >>>>> * or P or Q if they were recomputed >>>>> */ >>>>> - BUG_ON(s->uptodate < disks - 1); /* We don't need Q to >>>>> recover */ >>>> Thanks Dan! >>>> >>>> Would it make sense to rework the check as >>>> >>>> BUG_ON(s->uptodate < disks - 2); >> Dan> I think the problem is that any 'uptodate' vs 'disks' check is >> Dan> not precise enough in this path. What might be better is to put >> Dan> "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the devices that >> Dan> might try to kick off writes and then skip the action. Better to >> Dan> prevent the raid driver from taking unexpected action *and* keep >> Dan> the system alive vs killing the machine with BUG_ON. >> >> Dan> BUG_ON has fallen out of favor for exception reporting since >> Dan> those assertions were introduced. >> >> And since it' causes the system to crash... it's super annoying when >> the rest of the system is working fine. Please only use a WARN_ON, >> and maybe even set the RAID volume readonly, etc. But don't bring >> down the rest of the system if possible. >> >> John > > I reverted the first patch as it made its way upstream. > > Testing this change now. > > --- > > drivers/md/raid5.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index c033bfcb209e..660ca3af2914 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -4187,7 +4187,7 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, > /* now write out any block on a failed drive, > * or P or Q if they were recomputed > */ > - BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */ > + WARN_ON(s->uptodate < disks - 2); /* We don't need Q to recover */ > if (s->failed == 2) { > dev = &sh->dev[s->failed_num[1]]; > s->locked++; > -- > 2.20.1 >