Re: [PATCH] Don't jump to compute_result state from check_result state

Dan Williams <dan.j.williams@xxxxxxxxx> · Wed, 3 Apr 2019 18:18:43 -0700

On Thu, Mar 28, 2019 at 9:57 AM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
>
> On Thu, Mar 28, 2019 at 5:16 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
> >
> > On 3/26/19 7:33 PM, Song Liu wrote:
> >
> > On Tue, Mar 26, 2019 at 9:08 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
> >
> > Changing state from check_state_check_result to
> > check_state_compute_result not only is unsafe but also doesn't
> > appear to serve a valid purpose.  A raid6 check should only be
> > pushing out extra writes if doing repair and a mis-match occurs.
> > The stripe dev management will already try and do repair writes
> > for failing sectors.
> >
> > Could you please explain more details about the failed case before
> > this patch?
> >
> > Thanks,
> > Song
> >
> > Reproduced the issue with a modified scsi_debug module.  Redefined OPT_MEDIUM_ERR_ADDR to 12000 to move the I/O error where I wanted in on the disk.  I loaded the modified scsi_debug with parameter dev_size_mb=250 to make it bigger and partitioned it into partitions to make a test raid6 device out of.
> >
> >
> >    Device Boot      Start         End      Blocks   Id  System
> > /dev/sde1            2048       94207       46080   83  Linux
> > /dev/sde2           94208      186367       46080   83  Linux
> > /dev/sde3          186368      278527       46080   83  Linux
> > /dev/sde4          278528      511999      116736    5  Extended
> > /dev/sde5          280576      372735       46080   83  Linux
> > /dev/sde6          374784      466943       46080   83  Linux
> >
> > The raid6 device was created with:
> >
> > mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6
> >
> > After initially running, a check can be triggered and it work without issue.  The medium error feature of scsi_debug can then be turned on with:
> >
> > echo "2" >/sys/module/scsi_debug/parameters/opts
> >
> >
> > The sector 12000, which is part of sde1, will now fail to read and contains the q parity information for its stripe of the raid6 array.  Running a new check operation crashes the system at the same BUG_ON as seen by the customer.
> >
> > This patch makes the raid6 check_state_check_result handling
> > work more like raid5's.  If somehow too many failures for a
> > check, just quit the check operation for the stripe.  When any
> > checks pass, don't try and use check_state_compute_result for
> > a purpose it isn't needed for and is unsafe for.  Just mark the
> > stripe as in sync for passing its parity checks and let the
> > stripe dev read/write code and the bad blocks list do their
> > job handling I/O errors.
>
> + Dan Williams. But I am not sure whether Dan still remembers this
> work. :)
>
> Dan, please feel free to ignore this. :)
>
> Thanks Nigel and Xiao for these details. I will process the patch.
>
> Regards,
> Song
>
> >
> > OriginalAuthor: David Jeffery <djeffery@xxxxxxxxxx>
> > Tested-by: David Jeffery <djeffery@xxxxxxxxxx>
> > Signed-off-by: David Jeffy <djeffery@xxxxxxxxxx>
> > Signed-off-by: Nigel Croxon <ncroxon@xxxxxxxxxx>
> > ---
> >  drivers/md/raid5.c | 19 ++++---------------
> >  1 file changed, 4 insertions(+), 15 deletions(-)
> >
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index c033bfcb209e..4204a972ecf2 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
> >         case check_state_check_result:
> >                 sh->check_state = check_state_idle;
> >
> > +               if (s->failed > 1)
> > +                   break;
> >                 /* handle a successful check operation, if parity is correct
> >                  * we are done.  Otherwise update the mismatch count and repair
> >                  * parity if !MD_RECOVERY_CHECK
> >                  */
> >                 if (sh->ops.zero_sum_result == 0) {
> > -                       /* both parities are correct */
> > -                       if (!s->failed)
> > -                               set_bit(STRIPE_INSYNC, &sh->state);
> > -                       else {
> > -                               /* in contrast to the raid5 case we can validate
> > -                                * parity, but still have a failure to write
> > -                                * back
> > -                                */
> > -                               sh->check_state = check_state_compute_result;
> > -                               /* Returning at this point means that we may go
> > -                                * off and bring p and/or q uptodate again so
> > -                                * we make sure to check zero_sum_result again
> > -                                * to verify if p or q need writeback
> > -                                */
> > -                       }
> > +                       /* Any parity checked was correct */
> > +                       set_bit(STRIPE_INSYNC, &sh->state);

The details of why I did it this way have left my brain, but the
changelog does not address the concern raised in the comments about
needing to write-back the computed-result.

It also seems like a hack to just jump straight to the check-state
being idle without performing any writeback. My concern is that the
stripe is brought in-sync in memory, but not written to the on-disk
stripe.

Can you clarify which BUG_ON is being triggered?