Re: [PATCH] Don't jump to compute_result state from check_result state

Dan Williams <dan.j.williams@xxxxxxxxx> · Thu, 4 Apr 2019 17:52:34 -0700

Excellent analysis, thank you!

More comments below.

On Thu, Apr 4, 2019 at 4:46 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
[..]
> >>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> >>> index c033bfcb209e..4204a972ecf2 100644
> >>> --- a/drivers/md/raid5.c
> >>> +++ b/drivers/md/raid5.c
> >>> @@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
> >>>          case check_state_check_result:
> >>>                  sh->check_state = check_state_idle;
> >>>
> >>> +               if (s->failed > 1)
> >>> +                   break;
> >>>                  /* handle a successful check operation, if parity is correct
> >>>                   * we are done.  Otherwise update the mismatch count and repair
> >>>                   * parity if !MD_RECOVERY_CHECK
> >>>                   */
> >>>                  if (sh->ops.zero_sum_result == 0) {
> >>> -                       /* both parities are correct */
> >>> -                       if (!s->failed)
> >>> -                               set_bit(STRIPE_INSYNC, &sh->state);
> >>> -                       else {
> >>> -                               /* in contrast to the raid5 case we can validate
> >>> -                                * parity, but still have a failure to write
> >>> -                                * back
> >>> -                                */
> >>> -                               sh->check_state = check_state_compute_result;
> >>> -                               /* Returning at this point means that we may go
> >>> -                                * off and bring p and/or q uptodate again so
> >>> -                                * we make sure to check zero_sum_result again
> >>> -                                * to verify if p or q need writeback
> >>> -                                */
> >>> -                       }
> >>> +                       /* Any parity checked was correct */
> >>> +                       set_bit(STRIPE_INSYNC, &sh->state);
> > The details of why I did it this way have left my brain, but the
> > changelog does not address the concern raised in the comments about
> > needing to write-back the computed-result.
> >
> > It also seems like a hack to just jump straight to the check-state
> > being idle without performing any writeback. My concern is that the
> > stripe is brought in-sync in memory, but not written to the on-disk
> > stripe.
> >
> > Can you clarify which BUG_ON is being triggered?
>
>
> The customer's systems were using 8 disk RAID6 arrays.  The crashes
> occurred with one failing disk while scrubbing was occurring with the
> "check" action.  After the parity information for a stripe was checked
> for a scrubbing operation, the uptodate value was expected to be 7 or 8,
> a maximum of 1 disk failed, but the uptodate value was 6 which triggered
> the BUG_ON(s->uptodate < disk - 1).
>
> crash> struct stripe_head_state ffff8aba77affbf0
> struct stripe_head_state {
>    syncing = 1,
> ...
>    uptodate = 6,
>    to_read = 0,
>    to_write = 0,
>    failed = 1,
> ...
>
>
> For a stripe read into memory with a single failed disk, the failed
> counter was 1 as expected but the uptodate counter didn't consider all 7
> healthy disks as uptodate.  Looking at the individual disk states for
> the stripe, 6 of the 8 paths had a r5dev flags value of 17, or R5_Insync
> | R5_UPTODATE.  The remaining working disk had a flags value of only 16,
> or R5_Insync. The R5_UPTODATE flag was not set.
>
> crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[5]
> $8 = {
> ...
>    sector = 0,
>    flags = 16,
>    log_checksum = 0
> }
>
> This would appear to be why the update count was lower than expected.
> The last disk was the failed device, with flags R5_ReadNoMerge |
> R5_ReadError | R5_ReWrite.
>
> crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[6]
> $9 = {
> ...
>    sector = 0,
>    flags = 1792,
>    log_checksum = 0
> }
>
>
> These 2 not uptodate disks held the parity information for this RAID6
> stripe.
>
> crash> struct stripe_head ffff8a9c83273280
> struct stripe_head {
> ...
>    pd_idx = 5,
>    qd_idx = 6,
> ...
>
> The normal data disks of the stripe were as expected, the "q" parity
> device was failed, but the "p" parity device was the issue from not
> being marked uptodate while being considered healthy.
>
>
> The "p" parity disk uses the same parity format as the RAID5 code.  When
> the "p" disk is available but the "q" disk is failed, the RAID6 code
> does not use the RAID6-specific function which can check both sets of
> parity data.  Instead, it uses the RAID5 parity check code which also
> has the side effect of destroying the in-memory copy of the parity
> information.  This would be why the "p" parity disk was no longer marked
> uptodate.  The parity check destroyed its data.
>
> handle_parity_checks6() cleared the R5_UPTODATE flag and decreased
> uptodate itself to account for it using the parity check which will
> destroy the parity data:
>
>
>          case check_state_idle:
>                  /* start a new check operation if there are < 2 failures */
>                  if (s->failed == s->q_failed) {
>                          /* The only possible failed device holds Q, so it
>                           * makes sense to check P (If anything else
> were failed,
>                           * we would have used P to recreate it).
>                           */
>                          sh->check_state = check_state_run;
>                  }
> ...
>                  if (sh->check_state == check_state_run) {
>                          /* async_xor_zero_sum destroys the contents of P */
>                          clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
>                          s->uptodate--;
>                  }
>
>
> When the parity check succeeds, the handle_parity_checks6 function fails
> to account for this difference in behavior.   The parity check succeeds
> while having a failed device, and the raid6 code attempts to jump from
> "check_state_check_result" state to "check_state_compute_result" state
> without re-reading or re-computing the destroyed parity data.  This
> results in the uptodate count being too low for the
> "check_state_compute_result" state, triggering the bug.
>
>          case check_state_check_result:
>                  sh->check_state = check_state_idle;
>
>                  /* handle a successful check operation, if parity is
> correct
>                   * we are done.  Otherwise update the mismatch count
> and repair
>                   * parity if !MD_RECOVERY_CHECK
>                   */
>                  if (sh->ops.zero_sum_result == 0) {
>                          /* both parities are correct */
>                          if (!s->failed)
>                                  set_bit(STRIPE_INSYNC, &sh->state);
>                          else {
>                                  /* in contrast to the raid5 case we can
> validate
>                                   * parity, but still have a failure to
> write
>                                   * back
>                                   */
>                                  sh->check_state =
> check_state_compute_result;
>                                  /* Returning at this point means that
> we may go
>                                   * off and bring p and/or q uptodate
> again so
>                                   * we make sure to check
> zero_sum_result again
>                                   * to verify if p or q need writeback
>                                   */
>                          }
>
>
> To trigger this issue, you not only need a failed disk but to also
> perform a scrubbing operation.  The customer's systems both crashed
> early Sunday morning when the raid-check script is run by default from cron.

Ok, I follow this, but I come to a different answer on the required
fix. I think it is simply the following to preserve the writeback
action after the parity check, because we need the failed Q slot to be
written back if we're recovering. P will be not up-to-date because it
was checked with the good disks, but sh->ops.zero_sum_result will be
0, so that will skip the writeback of a !uptodate P value.

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c033bfcb209e..e2eb59289346 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4187,7 +4187,6 @@ static void handle_parity_checks6(struct r5conf
*conf, struct stripe_head *sh,
                /* now write out any block on a failed drive,
                 * or P or Q if they were recomputed
                 */
-               BUG_ON(s->uptodate < disks - 1); /* We don't need Q to
recover */
                if (s->failed == 2) {
                        dev = &sh->dev[s->failed_num[1]];
                        s->locked++;