On Thu, Apr 4, 2019 at 5:52 PM Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > Excellent analysis, thank you! > > More comments below. > > On Thu, Apr 4, 2019 at 4:46 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote: > [..] > > >>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > > >>> index c033bfcb209e..4204a972ecf2 100644 > > >>> --- a/drivers/md/raid5.c > > >>> +++ b/drivers/md/raid5.c > > >>> @@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh, > > >>> case check_state_check_result: > > >>> sh->check_state = check_state_idle; > > >>> > > >>> + if (s->failed > 1) > > >>> + break; > > >>> /* handle a successful check operation, if parity is correct > > >>> * we are done. Otherwise update the mismatch count and repair > > >>> * parity if !MD_RECOVERY_CHECK > > >>> */ > > >>> if (sh->ops.zero_sum_result == 0) { > > >>> - /* both parities are correct */ > > >>> - if (!s->failed) > > >>> - set_bit(STRIPE_INSYNC, &sh->state); > > >>> - else { > > >>> - /* in contrast to the raid5 case we can validate > > >>> - * parity, but still have a failure to write > > >>> - * back > > >>> - */ > > >>> - sh->check_state = check_state_compute_result; > > >>> - /* Returning at this point means that we may go > > >>> - * off and bring p and/or q uptodate again so > > >>> - * we make sure to check zero_sum_result again > > >>> - * to verify if p or q need writeback > > >>> - */ > > >>> - } > > >>> + /* Any parity checked was correct */ > > >>> + set_bit(STRIPE_INSYNC, &sh->state); > > > The details of why I did it this way have left my brain, but the > > > changelog does not address the concern raised in the comments about > > > needing to write-back the computed-result. > > > > > > It also seems like a hack to just jump straight to the check-state > > > being idle without performing any writeback. My concern is that the > > > stripe is brought in-sync in memory, but not written to the on-disk > > > stripe. > > > > > > Can you clarify which BUG_ON is being triggered? > > > > > > The customer's systems were using 8 disk RAID6 arrays. The crashes > > occurred with one failing disk while scrubbing was occurring with the > > "check" action. After the parity information for a stripe was checked > > for a scrubbing operation, the uptodate value was expected to be 7 or 8, > > a maximum of 1 disk failed, but the uptodate value was 6 which triggered > > the BUG_ON(s->uptodate < disk - 1). > > > > crash> struct stripe_head_state ffff8aba77affbf0 > > struct stripe_head_state { > > syncing = 1, > > ... > > uptodate = 6, > > to_read = 0, > > to_write = 0, > > failed = 1, > > ... > > > > > > For a stripe read into memory with a single failed disk, the failed > > counter was 1 as expected but the uptodate counter didn't consider all 7 > > healthy disks as uptodate. Looking at the individual disk states for > > the stripe, 6 of the 8 paths had a r5dev flags value of 17, or R5_Insync > > | R5_UPTODATE. The remaining working disk had a flags value of only 16, > > or R5_Insync. The R5_UPTODATE flag was not set. > > > > crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[5] > > $8 = { > > ... > > sector = 0, > > flags = 16, > > log_checksum = 0 > > } > > > > This would appear to be why the update count was lower than expected. > > The last disk was the failed device, with flags R5_ReadNoMerge | > > R5_ReadError | R5_ReWrite. > > > > crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[6] > > $9 = { > > ... > > sector = 0, > > flags = 1792, > > log_checksum = 0 > > } > > > > > > These 2 not uptodate disks held the parity information for this RAID6 > > stripe. > > > > crash> struct stripe_head ffff8a9c83273280 > > struct stripe_head { > > ... > > pd_idx = 5, > > qd_idx = 6, > > ... > > > > The normal data disks of the stripe were as expected, the "q" parity > > device was failed, but the "p" parity device was the issue from not > > being marked uptodate while being considered healthy. > > > > > > The "p" parity disk uses the same parity format as the RAID5 code. When > > the "p" disk is available but the "q" disk is failed, the RAID6 code > > does not use the RAID6-specific function which can check both sets of > > parity data. Instead, it uses the RAID5 parity check code which also > > has the side effect of destroying the in-memory copy of the parity > > information. This would be why the "p" parity disk was no longer marked > > uptodate. The parity check destroyed its data. > > > > handle_parity_checks6() cleared the R5_UPTODATE flag and decreased > > uptodate itself to account for it using the parity check which will > > destroy the parity data: > > > > > > case check_state_idle: > > /* start a new check operation if there are < 2 failures */ > > if (s->failed == s->q_failed) { > > /* The only possible failed device holds Q, so it > > * makes sense to check P (If anything else > > were failed, > > * we would have used P to recreate it). > > */ > > sh->check_state = check_state_run; > > } > > ... > > if (sh->check_state == check_state_run) { > > /* async_xor_zero_sum destroys the contents of P */ > > clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags); > > s->uptodate--; > > } > > > > > > When the parity check succeeds, the handle_parity_checks6 function fails > > to account for this difference in behavior. The parity check succeeds > > while having a failed device, and the raid6 code attempts to jump from > > "check_state_check_result" state to "check_state_compute_result" state > > without re-reading or re-computing the destroyed parity data. This > > results in the uptodate count being too low for the > > "check_state_compute_result" state, triggering the bug. > > > > case check_state_check_result: > > sh->check_state = check_state_idle; > > > > /* handle a successful check operation, if parity is > > correct > > * we are done. Otherwise update the mismatch count > > and repair > > * parity if !MD_RECOVERY_CHECK > > */ > > if (sh->ops.zero_sum_result == 0) { > > /* both parities are correct */ > > if (!s->failed) > > set_bit(STRIPE_INSYNC, &sh->state); > > else { > > /* in contrast to the raid5 case we can > > validate > > * parity, but still have a failure to > > write > > * back > > */ > > sh->check_state = > > check_state_compute_result; > > /* Returning at this point means that > > we may go > > * off and bring p and/or q uptodate > > again so > > * we make sure to check > > zero_sum_result again > > * to verify if p or q need writeback > > */ > > } > > > > > > To trigger this issue, you not only need a failed disk but to also > > perform a scrubbing operation. The customer's systems both crashed > > early Sunday morning when the raid-check script is run by default from cron. > > Ok, I follow this, but I come to a different answer on the required > fix. I think it is simply the following to preserve the writeback > action after the parity check, because we need the failed Q slot to be > written back if we're recovering. P will be not up-to-date because it > was checked with the good disks, but sh->ops.zero_sum_result will be > 0, so that will skip the writeback of a !uptodate P value. > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index c033bfcb209e..e2eb59289346 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -4187,7 +4187,6 @@ static void handle_parity_checks6(struct r5conf > *conf, struct stripe_head *sh, > /* now write out any block on a failed drive, > * or P or Q if they were recomputed > */ > - BUG_ON(s->uptodate < disks - 1); /* We don't need Q to > recover */ Thanks Dan! Would it make sense to rework the check as BUG_ON(s->uptodate < disks - 2); Song > if (s->failed == 2) { > dev = &sh->dev[s->failed_num[1]]; > s->locked++;