Re: [PATCH] Don't jump to compute_result state from check_result state

"John Stoffel" <john@xxxxxxxxxxx> · Tue, 9 Apr 2019 14:09:31 -0400

>>>>> "Dan" == Dan Williams <dan.j.williams@xxxxxxxxx> writes:

Dan> On Mon, Apr 8, 2019 at 4:18 PM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
Dan> [..]
>> > > To trigger this issue, you not only need a failed disk but to also
>> > > perform a scrubbing operation.  The customer's systems both crashed
>> > > early Sunday morning when the raid-check script is run by default from cron.
>> >
>> > Ok, I follow this, but I come to a different answer on the required
>> > fix. I think it is simply the following to preserve the writeback
>> > action after the parity check, because we need the failed Q slot to be
>> > written back if we're recovering. P will be not up-to-date because it
>> > was checked with the good disks, but sh->ops.zero_sum_result will be
>> > 0, so that will skip the writeback of a !uptodate P value.
>> >
>> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> > index c033bfcb209e..e2eb59289346 100644
>> > --- a/drivers/md/raid5.c
>> > +++ b/drivers/md/raid5.c
>> > @@ -4187,7 +4187,6 @@ static void handle_parity_checks6(struct r5conf
>> > *conf, struct stripe_head *sh,
>> >                 /* now write out any block on a failed drive,
>> >                  * or P or Q if they were recomputed
>> >                  */
>> > -               BUG_ON(s->uptodate < disks - 1); /* We don't need Q to
>> > recover */
>> 
>> Thanks Dan!
>> 
>> Would it make sense to rework the check as
>> 
>> BUG_ON(s->uptodate < disks - 2);

Dan> I think the problem is that any 'uptodate' vs 'disks' check is
Dan> not precise enough in this path. What might be better is to put
Dan> "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the devices that
Dan> might try to kick off writes and then skip the action. Better to
Dan> prevent the raid driver from taking unexpected action *and* keep
Dan> the system alive vs killing the machine with BUG_ON.

Dan> BUG_ON has fallen out of favor for exception reporting since
Dan> those assertions were introduced.

And since it' causes the system to crash... it's super annoying when
the rest of the system is working fine.  Please only use a WARN_ON,
and maybe even set the RAID volume readonly, etc.  But don't bring
down the rest of the system if possible.

John