Re: [PATCH] Don't jump to compute_result state from check_result state

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 4/3/19 9:18 PM, Dan Williams wrote:
On Thu, Mar 28, 2019 at 9:57 AM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
On Thu, Mar 28, 2019 at 5:16 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
On 3/26/19 7:33 PM, Song Liu wrote:

On Tue, Mar 26, 2019 at 9:08 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:

Changing state from check_state_check_result to
check_state_compute_result not only is unsafe but also doesn't
appear to serve a valid purpose.  A raid6 check should only be
pushing out extra writes if doing repair and a mis-match occurs.
The stripe dev management will already try and do repair writes
for failing sectors.

Could you please explain more details about the failed case before
this patch?

Thanks,
Song

Reproduced the issue with a modified scsi_debug module.  Redefined OPT_MEDIUM_ERR_ADDR to 12000 to move the I/O error where I wanted in on the disk.  I loaded the modified scsi_debug with parameter dev_size_mb=250 to make it bigger and partitioned it into partitions to make a test raid6 device out of.


    Device Boot      Start         End      Blocks   Id  System
/dev/sde1            2048       94207       46080   83  Linux
/dev/sde2           94208      186367       46080   83  Linux
/dev/sde3          186368      278527       46080   83  Linux
/dev/sde4          278528      511999      116736    5  Extended
/dev/sde5          280576      372735       46080   83  Linux
/dev/sde6          374784      466943       46080   83  Linux

The raid6 device was created with:

mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6

After initially running, a check can be triggered and it work without issue.  The medium error feature of scsi_debug can then be turned on with:

echo "2" >/sys/module/scsi_debug/parameters/opts


The sector 12000, which is part of sde1, will now fail to read and contains the q parity information for its stripe of the raid6 array.  Running a new check operation crashes the system at the same BUG_ON as seen by the customer.

This patch makes the raid6 check_state_check_result handling
work more like raid5's.  If somehow too many failures for a
check, just quit the check operation for the stripe.  When any
checks pass, don't try and use check_state_compute_result for
a purpose it isn't needed for and is unsafe for.  Just mark the
stripe as in sync for passing its parity checks and let the
stripe dev read/write code and the bad blocks list do their
job handling I/O errors.
+ Dan Williams. But I am not sure whether Dan still remembers this
work. :)

Dan, please feel free to ignore this. :)

Thanks Nigel and Xiao for these details. I will process the patch.

Regards,
Song

OriginalAuthor: David Jeffery <djeffery@xxxxxxxxxx>
Tested-by: David Jeffery <djeffery@xxxxxxxxxx>
Signed-off-by: David Jeffy <djeffery@xxxxxxxxxx>
Signed-off-by: Nigel Croxon <ncroxon@xxxxxxxxxx>
---
  drivers/md/raid5.c | 19 ++++---------------
  1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c033bfcb209e..4204a972ecf2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
         case check_state_check_result:
                 sh->check_state = check_state_idle;

+               if (s->failed > 1)
+                   break;
                 /* handle a successful check operation, if parity is correct
                  * we are done.  Otherwise update the mismatch count and repair
                  * parity if !MD_RECOVERY_CHECK
                  */
                 if (sh->ops.zero_sum_result == 0) {
-                       /* both parities are correct */
-                       if (!s->failed)
-                               set_bit(STRIPE_INSYNC, &sh->state);
-                       else {
-                               /* in contrast to the raid5 case we can validate
-                                * parity, but still have a failure to write
-                                * back
-                                */
-                               sh->check_state = check_state_compute_result;
-                               /* Returning at this point means that we may go
-                                * off and bring p and/or q uptodate again so
-                                * we make sure to check zero_sum_result again
-                                * to verify if p or q need writeback
-                                */
-                       }
+                       /* Any parity checked was correct */
+                       set_bit(STRIPE_INSYNC, &sh->state);
The details of why I did it this way have left my brain, but the
changelog does not address the concern raised in the comments about
needing to write-back the computed-result.

It also seems like a hack to just jump straight to the check-state
being idle without performing any writeback. My concern is that the
stripe is brought in-sync in memory, but not written to the on-disk
stripe.

Can you clarify which BUG_ON is being triggered?


The customer's systems were using 8 disk RAID6 arrays.  The crashes occurred with one failing disk while scrubbing was occurring with the "check" action.  After the parity information for a stripe was checked for a scrubbing operation, the uptodate value was expected to be 7 or 8, a maximum of 1 disk failed, but the uptodate value was 6 which triggered the BUG_ON(s->uptodate < disk - 1).

crash> struct stripe_head_state ffff8aba77affbf0
struct stripe_head_state {
  syncing = 1,
...
  uptodate = 6,
  to_read = 0,
  to_write = 0,
  failed = 1,
...


For a stripe read into memory with a single failed disk, the failed counter was 1 as expected but the uptodate counter didn't consider all 7 healthy disks as uptodate.  Looking at the individual disk states for the stripe, 6 of the 8 paths had a r5dev flags value of 17, or R5_Insync | R5_UPTODATE.  The remaining working disk had a flags value of only 16, or R5_Insync. The R5_UPTODATE flag was not set.

crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[5]
$8 = {
...
  sector = 0,
  flags = 16,
  log_checksum = 0
}

This would appear to be why the update count was lower than expected.  The last disk was the failed device, with flags R5_ReadNoMerge | R5_ReadError | R5_ReWrite.

crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[6]
$9 = {
...
  sector = 0,
  flags = 1792,
  log_checksum = 0
}


These 2 not uptodate disks held the parity information for this RAID6 stripe.

crash> struct stripe_head ffff8a9c83273280
struct stripe_head {
...
  pd_idx = 5,
  qd_idx = 6,
...

The normal data disks of the stripe were as expected, the "q" parity device was failed, but the "p" parity device was the issue from not being marked uptodate while being considered healthy.


The "p" parity disk uses the same parity format as the RAID5 code.  When the "p" disk is available but the "q" disk is failed, the RAID6 code does not use the RAID6-specific function which can check both sets of parity data.  Instead, it uses the RAID5 parity check code which also has the side effect of destroying the in-memory copy of the parity information.  This would be why the "p" parity disk was no longer marked uptodate.  The parity check destroyed its data.

handle_parity_checks6() cleared the R5_UPTODATE flag and decreased uptodate itself to account for it using the parity check which will destroy the parity data:


        case check_state_idle:
                /* start a new check operation if there are < 2 failures */
                if (s->failed == s->q_failed) {
                        /* The only possible failed device holds Q, so it
                         * makes sense to check P (If anything else were failed,
                         * we would have used P to recreate it).
                         */
                        sh->check_state = check_state_run;
                }
...
                if (sh->check_state == check_state_run) {
                        /* async_xor_zero_sum destroys the contents of P */
                        clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
                        s->uptodate--;
                }


When the parity check succeeds, the handle_parity_checks6 function fails to account for this difference in behavior.   The parity check succeeds while having a failed device, and the raid6 code attempts to jump from "check_state_check_result" state to "check_state_compute_result" state without re-reading or re-computing the destroyed parity data.  This results in the uptodate count being too low for the "check_state_compute_result" state, triggering the bug.

        case check_state_check_result:
                sh->check_state = check_state_idle;

                /* handle a successful check operation, if parity is correct                  * we are done.  Otherwise update the mismatch count and repair
                 * parity if !MD_RECOVERY_CHECK
                 */
                if (sh->ops.zero_sum_result == 0) {
                        /* both parities are correct */
                        if (!s->failed)
                                set_bit(STRIPE_INSYNC, &sh->state);
                        else {
                                /* in contrast to the raid5 case we can validate                                  * parity, but still have a failure to write
                                 * back
                                 */
                                sh->check_state = check_state_compute_result;                                 /* Returning at this point means that we may go                                  * off and bring p and/or q uptodate again so                                  * we make sure to check zero_sum_result again
                                 * to verify if p or q need writeback
                                 */
                        }


To trigger this issue, you not only need a failed disk but to also perform a scrubbing operation.  The customer's systems both crashed early Sunday morning when the raid-check script is run by default from cron.





[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux