Re: [PATCH] Don't jump to compute_result state from check_result state

Nigel Croxon <ncroxon@xxxxxxxxxx> · Thu, 4 Apr 2019 07:45:54 -0400

On 4/3/19 9:18 PM, Dan Williams wrote:
On Thu, Mar 28, 2019 at 9:57 AM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
On Thu, Mar 28, 2019 at 5:16 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
On 3/26/19 7:33 PM, Song Liu wrote:

On Tue, Mar 26, 2019 at 9:08 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:

Changing state from check_state_check_result to
check_state_compute_result not only is unsafe but also doesn't
appear to serve a valid purpose.  A raid6 check should only be
pushing out extra writes if doing repair and a mis-match occurs.
The stripe dev management will already try and do repair writes
for failing sectors.

Could you please explain more details about the failed case before
this patch?

Thanks,
Song

Reproduced the issue with a modified scsi_debug module.  Redefined OPT_MEDIUM_ERR_ADDR to 12000 to move the I/O error where I wanted in on the disk.  I loaded the modified scsi_debug with parameter dev_size_mb=250 to make it bigger and partitioned it into partitions to make a test raid6 device out of.


    Device Boot      Start         End      Blocks   Id  System
/dev/sde1            2048       94207       46080   83  Linux
/dev/sde2           94208      186367       46080   83  Linux
/dev/sde3          186368      278527       46080   83  Linux
/dev/sde4          278528      511999      116736    5  Extended
/dev/sde5          280576      372735       46080   83  Linux
/dev/sde6          374784      466943       46080   83  Linux

The raid6 device was created with:

mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6

After initially running, a check can be triggered and it work without issue.  The medium error feature of scsi_debug can then be turned on with:

echo "2" >/sys/module/scsi_debug/parameters/opts


The sector 12000, which is part of sde1, will now fail to read and contains the q parity information for its stripe of the raid6 array.  Running a new check operation crashes the system at the same BUG_ON as seen by the customer.

This patch makes the raid6 check_state_check_result handling
work more like raid5's.  If somehow too many failures for a
check, just quit the check operation for the stripe.  When any
checks pass, don't try and use check_state_compute_result for
a purpose it isn't needed for and is unsafe for.  Just mark the
stripe as in sync for passing its parity checks and let the
stripe dev read/write code and the bad blocks list do their
job handling I/O errors.
+ Dan Williams. But I am not sure whether Dan still remembers this
work. :)

Dan, please feel free to ignore this. :)

Thanks Nigel and Xiao for these details. I will process the patch.

Regards,
Song

OriginalAuthor: David Jeffery <djeffery@xxxxxxxxxx>
Tested-by: David Jeffery <djeffery@xxxxxxxxxx>
Signed-off-by: David Jeffy <djeffery@xxxxxxxxxx>
Signed-off-by: Nigel Croxon <ncroxon@xxxxxxxxxx>
---
  drivers/md/raid5.c | 19 ++++---------------
  1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c033bfcb209e..4204a972ecf2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
         case check_state_check_result:
                 sh->check_state = check_state_idle;

+               if (s->failed > 1)
+                   break;
                 /* handle a successful check operation, if parity is correct
                  * we are done.  Otherwise update the mismatch count and repair
                  * parity if !MD_RECOVERY_CHECK
                  */
                 if (sh->ops.zero_sum_result == 0) {
-                       /* both parities are correct */
-                       if (!s->failed)
-                               set_bit(STRIPE_INSYNC, &sh->state);
-                       else {
-                               /* in contrast to the raid5 case we can validate
-                                * parity, but still have a failure to write
-                                * back
-                                */
-                               sh->check_state = check_state_compute_result;
-                               /* Returning at this point means that we may go
-                                * off and bring p and/or q uptodate again so
-                                * we make sure to check zero_sum_result again
-                                * to verify if p or q need writeback
-                                */
-                       }
+                       /* Any parity checked was correct */
+                       set_bit(STRIPE_INSYNC, &sh->state);
The details of why I did it this way have left my brain, but the
changelog does not address the concern raised in the comments about
needing to write-back the computed-result.

It also seems like a hack to just jump straight to the check-state
being idle without performing any writeback. My concern is that the
stripe is brought in-sync in memory, but not written to the on-disk
stripe.

Can you clarify which BUG_ON is being triggered?


The customer's systems were using 8 disk RAID6 arrays.  The crashes 
occurred with one failing disk while scrubbing was occurring with the 
"check" action.  After the parity information for a stripe was checked 
for a scrubbing operation, the uptodate value was expected to be 7 or 8, 
a maximum of 1 disk failed, but the uptodate value was 6 which triggered 
the BUG_ON(s->uptodate < disk - 1).

crash> struct stripe_head_state ffff8aba77affbf0
struct stripe_head_state {
  syncing = 1,
...
  uptodate = 6,
  to_read = 0,
  to_write = 0,
  failed = 1,
...


For a stripe read into memory with a single failed disk, the failed 
counter was 1 as expected but the uptodate counter didn't consider all 7 
healthy disks as uptodate.  Looking at the individual disk states for 
the stripe, 6 of the 8 paths had a r5dev flags value of 17, or R5_Insync 
| R5_UPTODATE.  The remaining working disk had a flags value of only 16, 
or R5_Insync. The R5_UPTODATE flag was not set.

crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[5]
$8 = {
...
  sector = 0,
  flags = 16,
  log_checksum = 0
}

This would appear to be why the update count was lower than expected.  
The last disk was the failed device, with flags R5_ReadNoMerge | 
R5_ReadError | R5_ReWrite.

crash> p ((struct stripe_head *)0xffff8a9c83273280)->dev[6]
$9 = {
...
  sector = 0,
  flags = 1792,
  log_checksum = 0
}


These 2 not uptodate disks held the parity information for this RAID6 
stripe.

crash> struct stripe_head ffff8a9c83273280
struct stripe_head {
...
  pd_idx = 5,
  qd_idx = 6,
...

The normal data disks of the stripe were as expected, the "q" parity 
device was failed, but the "p" parity device was the issue from not 
being marked uptodate while being considered healthy.


The "p" parity disk uses the same parity format as the RAID5 code.  When 
the "p" disk is available but the "q" disk is failed, the RAID6 code 
does not use the RAID6-specific function which can check both sets of 
parity data.  Instead, it uses the RAID5 parity check code which also 
has the side effect of destroying the in-memory copy of the parity 
information.  This would be why the "p" parity disk was no longer marked 
uptodate.  The parity check destroyed its data.

handle_parity_checks6() cleared the R5_UPTODATE flag and decreased 
uptodate itself to account for it using the parity check which will 
destroy the parity data:


        case check_state_idle:
                /* start a new check operation if there are < 2 failures */
                if (s->failed == s->q_failed) {
                        /* The only possible failed device holds Q, so it
                         * makes sense to check P (If anything else 
were failed,
                         * we would have used P to recreate it).
                         */
                        sh->check_state = check_state_run;
                }
...
                if (sh->check_state == check_state_run) {
                        /* async_xor_zero_sum destroys the contents of P */
                        clear_bit(R5_UPTODATE, &sh->dev[pd_idx].flags);
                        s->uptodate--;
                }


When the parity check succeeds, the handle_parity_checks6 function fails 
to account for this difference in behavior.   The parity check succeeds 
while having a failed device, and the raid6 code attempts to jump from 
"check_state_check_result" state to "check_state_compute_result" state 
without re-reading or re-computing the destroyed parity data.  This 
results in the uptodate count being too low for the 
"check_state_compute_result" state, triggering the bug.

        case check_state_check_result:
                sh->check_state = check_state_idle;

                /* handle a successful check operation, if parity is 
correct
                 * we are done.  Otherwise update the mismatch count 
and repair
                 * parity if !MD_RECOVERY_CHECK
                 */
                if (sh->ops.zero_sum_result == 0) {
                        /* both parities are correct */
                        if (!s->failed)
                                set_bit(STRIPE_INSYNC, &sh->state);
                        else {
                                /* in contrast to the raid5 case we can 
validate
                                 * parity, but still have a failure to 
write
                                 * back
                                 */
                                sh->check_state = 
check_state_compute_result;
                                /* Returning at this point means that 
we may go
                                 * off and bring p and/or q uptodate 
again so
                                 * we make sure to check 
zero_sum_result again
                                 * to verify if p or q need writeback
                                 */
                        }


To trigger this issue, you not only need a failed disk but to also 
perform a scrubbing operation.  The customer's systems both crashed 
early Sunday morning when the raid-check script is run by default from cron.