Re: [PATCH] Don't jump to compute_result state from check_result state

Song Liu <liu.song.a23@xxxxxxxxx> · Mon, 8 Apr 2019 16:17:50 -0700

On Thu, Apr 4, 2019 at 5:52 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
>
> Excellent analysis, thank you!
>
> More comments below.
>
> On Thu, Apr 4, 2019 at 4:46 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
> [..]
> > >>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > >>> index c033bfcb209e..4204a972ecf2 100644
> > >>> --- a/drivers/md/raid5.c
> > >>> +++ b/drivers/md/raid5.c
> > >>> @@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
> > >>>          case check_state_check_result:
> > >>>                  sh->check_state = check_state_idle;
> > >>>
> > >>> +               if (s->failed > 1)
> > >>> +                   break;
> > >>>                  /* handle a successful check operation, if parity is correct
> > >>>                   * we are done.  Otherwise update the mismatch count and repair
> > >>>                   * parity if !MD_RECOVERY_CHECK
> > >>>                   */
> > >>>                  if (sh->ops.zero_sum_result == 0) {
> > >>> -                       /* both parities are correct */
> > >>> -                       if (!s->failed)
> > >>> -                               set_bit(STRIPE_INSYNC, &sh->state);
> > >>> -                       else {
> > >>> -                               /* in contrast to the raid5 case we can validate
> > >>> -                                * parity, but still have a failure to write
> > >>> -                                * back
> > >>> -                                */
> > >>> -                               sh->check_state = check_state_compute_result;
> > >>> -                               /* Returning at this point means that we may go
> > >>> -                                * off and bring p and/or q uptodate again so
> > >>> -                                * we make sure to check zero_sum_result again
> > >>> -                                * to verify if p or q need writeback
> > >>> -                                */
> > >>> -                       }
> > >>> +                       /* Any parity checked was correct */
> > >>> +                       set_bit(STRIPE_INSYNC, &sh->state);
> > > The details of why I did it this way have left my brain, but the
> > > changelog does not address the concern raised in the comments about
> > > needing to write-back the computed-result.
> > >
> > > It also seems like a hack to just jump straight to the check-state
> > > being idle without performing any writeback. My concern is that the
> > > stripe is brought in-sync in memory, but not written to the on-disk
> > > stripe.
> > >
> > > Can you clarify which BUG_ON is being triggered?
> >
> >
> > The customer's systems were using 8 disk RAID6 arrays.  The crashes
> > occurred with one failing disk while scrubbing was occurring with the
> > "check" action.  After the parity information for a stripe was checked
> > for a scrubbing operation, the uptodate value was expected to be 7 or 8,
> > a maximum of 1 disk failed, but the uptodate value was 6 which triggered
> > the BUG_ON(s->uptodate < disk - 1).
> >
> > crash> struct stripe_head_state ffff8aba77affbf0
> > struct stripe_head_state {
> >    syncing = 1,
> > ...
> >    uptodate = 6,
> >    to_read = 0,
> >    to_write = 0,
> >    failed = 1,
> > ...
> >
> >
> > For a stripe read into memory with a single failed disk, the failed
> > counter was 1 as expected but the uptodate counter didn't consider all 7
> > healthy disks as uptodate.  Looking at the individual disk states for
> > the stripe, 6 of the 8 paths had a r5dev flags value of 17, or R5_Insync
> > | R5_UPTODATE.  The remaining working disk had a flags value of only 16,
> > or R5_Insync. The R5_UPTODATE flag was not set.
> >
> > crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[5]
> > $8 = {
> > ...
> >    sector = 0,
> >    flags = 16,
> >    log_checksum = 0
> > }
> >
> > This would appear to be why the update count was lower than expected.
> > The last disk was the failed device, with flags R5_ReadNoMerge |
> > R5_ReadError | R5_ReWrite.
> >
> > crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[6]
> > $9 = {
> > ...
> >    sector = 0,
> >    flags = 1792,
> >    log_checksum = 0
> > }
> >
> >
> > These 2 not uptodate disks held the parity information for this RAID6
> > stripe.
> >
> > crash> struct stripe_head ffff8a9c83273280
> > struct stripe_head {
> > ...
> >    pd_idx = 5,
> >    qd_idx = 6,
> > ...
> >
> > The normal data disks of the stripe were as expected, the "q" parity
> > device was failed, but the "p" parity device was the issue from not
> > being marked uptodate while being considered healthy.
> >
> >
> > The "p" parity disk uses the same parity format as the RAID5 code.  When
> > the "p" disk is available but the "q" disk is failed, the RAID6 code
> > does not use the RAID6-specific function which can check both sets of
> > parity data.  Instead, it uses the RAID5 parity check code which also
> > has the side effect of destroying the in-memory copy of the parity
> > information.  This would be why the "p" parity disk was no longer marked
> > uptodate.  The parity check destroyed its data.
> >
> > handle_parity_checks6() cleared the R5_UPTODATE flag and decreased
> > uptodate itself to account for it using the parity check which will
> > destroy the parity data:
> >
> >
> >          case check_state_idle:
> >                  /* start a new check operation if there are < 2 failures */
> >                  if (s->failed == s->q_failed) {
> >                          /* The only possible failed device holds Q, so it
> >                           * makes sense to check P (If anything else
> > were failed,
> >                           * we would have used P to recreate it).
> >                           */
> >                          sh->check_state = check_state_run;
> >                  }
> > ...
> >                  if (sh->check_state == check_state_run) {
> >                          /* async_xor_zero_sum destroys the contents of P */
> >                          clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
> >                          s->uptodate--;
> >                  }
> >
> >
> > When the parity check succeeds, the handle_parity_checks6 function fails
> > to account for this difference in behavior.   The parity check succeeds
> > while having a failed device, and the raid6 code attempts to jump from
> > "check_state_check_result" state to "check_state_compute_result" state
> > without re-reading or re-computing the destroyed parity data.  This
> > results in the uptodate count being too low for the
> > "check_state_compute_result" state, triggering the bug.
> >
> >          case check_state_check_result:
> >                  sh->check_state = check_state_idle;
> >
> >                  /* handle a successful check operation, if parity is
> > correct
> >                   * we are done.  Otherwise update the mismatch count
> > and repair
> >                   * parity if !MD_RECOVERY_CHECK
> >                   */
> >                  if (sh->ops.zero_sum_result == 0) {
> >                          /* both parities are correct */
> >                          if (!s->failed)
> >                                  set_bit(STRIPE_INSYNC, &sh->state);
> >                          else {
> >                                  /* in contrast to the raid5 case we can
> > validate
> >                                   * parity, but still have a failure to
> > write
> >                                   * back
> >                                   */
> >                                  sh->check_state =
> > check_state_compute_result;
> >                                  /* Returning at this point means that
> > we may go
> >                                   * off and bring p and/or q uptodate
> > again so
> >                                   * we make sure to check
> > zero_sum_result again
> >                                   * to verify if p or q need writeback
> >                                   */
> >                          }
> >
> >
> > To trigger this issue, you not only need a failed disk but to also
> > perform a scrubbing operation.  The customer's systems both crashed
> > early Sunday morning when the raid-check script is run by default from cron.
>
> Ok, I follow this, but I come to a different answer on the required
> fix. I think it is simply the following to preserve the writeback
> action after the parity check, because we need the failed Q slot to be
> written back if we're recovering. P will be not up-to-date because it
> was checked with the good disks, but sh->ops.zero_sum_result will be
> 0, so that will skip the writeback of a !uptodate P value.
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c033bfcb209e..e2eb59289346 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -4187,7 +4187,6 @@ static void handle_parity_checks6(struct r5conf
> *conf, struct stripe_head *sh,
>                 /* now write out any block on a failed drive,
>                  * or P or Q if they were recomputed
>                  */
> -               BUG_ON(s->uptodate < disks - 1); /* We don't need Q to
> recover */

Thanks Dan!

Would it make sense to rework the check as

  BUG_ON(s->uptodate < disks - 2);

Song

>                 if (s->failed == 2) {
>                         dev = &sh->dev[s->failed_num[1]];
>                         s->locked++;