Excellent analysis, thank you! More comments below. On Thu, Apr 4, 2019 at 4:46 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote: [..] > >>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > >>> index c033bfcb209e..4204a972ecf2 100644 > >>> --- a/drivers/md/raid5.c > >>> +++ b/drivers/md/raid5.c > >>> @@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, > >>> case check_state_check_result: > >>> sh->check_state = check_state_idle; > >>> > >>> + if (s->failed > 1) > >>> + break; > >>> /* handle a successful check operation, if parity is correct > >>> * we are done. Otherwise update the mismatch count and repair > >>> * parity if !MD_RECOVERY_CHECK > >>> */ > >>> if (sh->ops.zero_sum_result == 0) { > >>> - /* both parities are correct */ > >>> - if (!s->failed) > >>> - set_bit(STRIPE_INSYNC, &sh->state); > >>> - else { > >>> - /* in contrast to the raid5 case we can validate > >>> - * parity, but still have a failure to write > >>> - * back > >>> - */ > >>> - sh->check_state = check_state_compute_result; > >>> - /* Returning at this point means that we may go > >>> - * off and bring p and/or q uptodate again so > >>> - * we make sure to check zero_sum_result again > >>> - * to verify if p or q need writeback > >>> - */ > >>> - } > >>> + /* Any parity checked was correct */ > >>> + set_bit(STRIPE_INSYNC, &sh->state); > > The details of why I did it this way have left my brain, but the > > changelog does not address the concern raised in the comments about > > needing to write-back the computed-result. > > > > It also seems like a hack to just jump straight to the check-state > > being idle without performing any writeback. My concern is that the > > stripe is brought in-sync in memory, but not written to the on-disk > > stripe. > > > > Can you clarify which BUG_ON is being triggered? > > > The customer's systems were using 8 disk RAID6 arrays. The crashes > occurred with one failing disk while scrubbing was occurring with the > "check" action. After the parity information for a stripe was checked > for a scrubbing operation, the uptodate value was expected to be 7 or 8, > a maximum of 1 disk failed, but the uptodate value was 6 which triggered > the BUG_ON(s->uptodate < disk - 1). > > crash> struct stripe_head_state ffff8aba77affbf0 > struct stripe_head_state { > syncing = 1, > ... > uptodate = 6, > to_read = 0, > to_write = 0, > failed = 1, > ... > > > For a stripe read into memory with a single failed disk, the failed > counter was 1 as expected but the uptodate counter didn't consider all 7 > healthy disks as uptodate. Looking at the individual disk states for > the stripe, 6 of the 8 paths had a r5dev flags value of 17, or R5_Insync > | R5_UPTODATE. The remaining working disk had a flags value of only 16, > or R5_Insync. The R5_UPTODATE flag was not set. > > crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[5] > $8 = { > ... > sector = 0, > flags = 16, > log_checksum = 0 > } > > This would appear to be why the update count was lower than expected. > The last disk was the failed device, with flags R5_ReadNoMerge | > R5_ReadError | R5_ReWrite. > > crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[6] > $9 = { > ... > sector = 0, > flags = 1792, > log_checksum = 0 > } > > > These 2 not uptodate disks held the parity information for this RAID6 > stripe. > > crash> struct stripe_head ffff8a9c83273280 > struct stripe_head { > ... > pd_idx = 5, > qd_idx = 6, > ... > > The normal data disks of the stripe were as expected, the "q" parity > device was failed, but the "p" parity device was the issue from not > being marked uptodate while being considered healthy. > > > The "p" parity disk uses the same parity format as the RAID5 code. When > the "p" disk is available but the "q" disk is failed, the RAID6 code > does not use the RAID6-specific function which can check both sets of > parity data. Instead, it uses the RAID5 parity check code which also > has the side effect of destroying the in-memory copy of the parity > information. This would be why the "p" parity disk was no longer marked > uptodate. The parity check destroyed its data. > > handle_parity_checks6() cleared the R5_UPTODATE flag and decreased > uptodate itself to account for it using the parity check which will > destroy the parity data: > > > case check_state_idle: > /* start a new check operation if there are < 2 failures */ > if (s->failed == s->q_failed) { > /* The only possible failed device holds Q, so it > * makes sense to check P (If anything else > were failed, > * we would have used P to recreate it). > */ > sh->check_state = check_state_run; > } > ... > if (sh->check_state == check_state_run) { > /* async_xor_zero_sum destroys the contents of P */ > clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags); > s->uptodate--; > } > > > When the parity check succeeds, the handle_parity_checks6 function fails > to account for this difference in behavior. The parity check succeeds > while having a failed device, and the raid6 code attempts to jump from > "check_state_check_result" state to "check_state_compute_result" state > without re-reading or re-computing the destroyed parity data. This > results in the uptodate count being too low for the > "check_state_compute_result" state, triggering the bug. > > case check_state_check_result: > sh->check_state = check_state_idle; > > /* handle a successful check operation, if parity is > correct > * we are done. Otherwise update the mismatch count > and repair > * parity if !MD_RECOVERY_CHECK > */ > if (sh->ops.zero_sum_result == 0) { > /* both parities are correct */ > if (!s->failed) > set_bit(STRIPE_INSYNC, &sh->state); > else { > /* in contrast to the raid5 case we can > validate > * parity, but still have a failure to > write > * back > */ > sh->check_state = > check_state_compute_result; > /* Returning at this point means that > we may go > * off and bring p and/or q uptodate > again so > * we make sure to check > zero_sum_result again > * to verify if p or q need writeback > */ > } > > > To trigger this issue, you not only need a failed disk but to also > perform a scrubbing operation. The customer's systems both crashed > early Sunday morning when the raid-check script is run by default from cron. Ok, I follow this, but I come to a different answer on the required fix. I think it is simply the following to preserve the writeback action after the parity check, because we need the failed Q slot to be written back if we're recovering. P will be not up-to-date because it was checked with the good disks, but sh->ops.zero_sum_result will be 0, so that will skip the writeback of a !uptodate P value. diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c033bfcb209e..e2eb59289346 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4187,7 +4187,6 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, /* now write out any block on a failed drive, * or P or Q if they were recomputed */ - BUG_ON(s->uptodate < disks - 1); /* We don't need Q to recover */ if (s->failed == 2) { dev = &sh->dev[s->failed_num[1]]; s->locked++;