Re: [PATCH] Don't jump to compute_result state from check_result state

Nigel Croxon <ncroxon@xxxxxxxxxx> · Thu, 4 Apr 2019 07:41:37 -0400

On 4/3/19 9:18 PM, Dan Williams wrote:
On Thu, Mar 28, 2019 at 9:57 AM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
On Thu, Mar 28, 2019 at 5:16 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
On 3/26/19 7:33 PM, Song Liu wrote:

On Tue, Mar 26, 2019 at 9:08 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:

Changing state from check_state_check_result to
check_state_compute_result not only is unsafe but also doesn't
appear to serve a valid purpose.  A raid6 check should only be
pushing out extra writes if doing repair and a mis-match occurs.
The stripe dev management will already try and do repair writes
for failing sectors.

Could you please explain more details about the failed case before
this patch?

Thanks,
Song

Reproduced the issue with a modified scsi_debug module.  Redefined OPT_MEDIUM_ERR_ADDR to 12000 to move the I/O error where I wanted in on the disk.  I loaded the modified scsi_debug with parameter dev_size_mb=250 to make it bigger and partitioned it into partitions to make a test raid6 device out of.


    Device Boot      Start         End      Blocks   Id  System
/dev/sde1            2048       94207       46080   83  Linux
/dev/sde2           94208      186367       46080   83  Linux
/dev/sde3          186368      278527       46080   83  Linux
/dev/sde4          278528      511999      116736    5  Extended
/dev/sde5          280576      372735       46080   83  Linux
/dev/sde6          374784      466943       46080   83  Linux

The raid6 device was created with:

mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6

After initially running, a check can be triggered and it work without issue.  The medium error feature of scsi_debug can then be turned on with:

echo "2" >/sys/module/scsi_debug/parameters/opts


The sector 12000, which is part of sde1, will now fail to read and contains the q parity information for its stripe of the raid6 array.  Running a new check operation crashes the system at the same BUG_ON as seen by the customer.

This patch makes the raid6 check_state_check_result handling
work more like raid5's.  If somehow too many failures for a
check, just quit the check operation for the stripe.  When any
checks pass, don't try and use check_state_compute_result for
a purpose it isn't needed for and is unsafe for.  Just mark the
stripe as in sync for passing its parity checks and let the
stripe dev read/write code and the bad blocks list do their
job handling I/O errors.
+ Dan Williams. But I am not sure whether Dan still remembers this
work. :)

Dan, please feel free to ignore this. :)

Thanks Nigel and Xiao for these details. I will process the patch.

Regards,
Song

OriginalAuthor: David Jeffery <djeffery@xxxxxxxxxx>
Tested-by: David Jeffery <djeffery@xxxxxxxxxx>
Signed-off-by: David Jeffy <djeffery@xxxxxxxxxx>
Signed-off-by: Nigel Croxon <ncroxon@xxxxxxxxxx>
---
  drivers/md/raid5.c | 19 ++++---------------
  1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c033bfcb209e..4204a972ecf2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
         case check_state_check_result:
                 sh->check_state = check_state_idle;

+               if (s->failed > 1)
+                   break;
                 /* handle a successful check operation, if parity is correct
                  * we are done.  Otherwise update the mismatch count and repair
                  * parity if !MD_RECOVERY_CHECK
                  */
                 if (sh->ops.zero_sum_result == 0) {
-                       /* both parities are correct */
-                       if (!s->failed)
-                               set_bit(STRIPE_INSYNC, &sh->state);
-                       else {
-                               /* in contrast to the raid5 case we can validate
-                                * parity, but still have a failure to write
-                                * back
-                                */
-                               sh->check_state = check_state_compute_result;
-                               /* Returning at this point means that we may go
-                                * off and bring p and/or q uptodate again so
-                                * we make sure to check zero_sum_result again
-                                * to verify if p or q need writeback
-                                */
-                       }
+                       /* Any parity checked was correct */
+                       set_bit(STRIPE_INSYNC, &sh->state);
The details of why I did it this way have left my brain, but the
changelog does not address the concern raised in the comments about
needing to write-back the computed-result.

It also seems like a hack to just jump straight to the check-state
being idle without performing any writeback. My concern is that the
stripe is brought in-sync in memory, but not written to the on-disk
stripe.

Can you clarify which BUG_ON is being triggered?


BUG_ON(s->uptodate < disks - 1) condition encountered in RAID6 array 
with 6 disks and while one disk had failed.


crash> bt
PID: 1138   TASK: ffff88081cba4f10  CPU: 7   COMMAND: "md63_raid6"
 #0 [ffff880810fbb7f0] machine_kexec at ffffffff8105c58b
 #1 [ffff880810fbb850] __crash_kexec at ffffffff81106742
 #2 [ffff880810fbb920] crash_kexec at ffffffff81106830
 #3 [ffff880810fbb938] oops_end at ffffffff816b0aa8
 #4 [ffff880810fbb960] die at ffffffff8102e87b
 #5 [ffff880810fbb990] do_trap at ffffffff816b01f0
 #6 [ffff880810fbb9e0] do_invalid_op at ffffffff8102b174
 #7 [ffff880810fbba90] invalid_op at ffffffff816bd1ae
    [exception RIP: handle_parity_checks6+783]
    RIP: ffffffffc02e5f6f  RSP: ffff880810fbbb40  RFLAGS: 00010297
    RAX: 000000000000202b  RBX: 0000000000000000  RCX: 0000000000000005
    RDX: ffff880810fbbbf0  RSI: ffff8807b7b7b070  RDI: ffff88081e1c0028
    RBP: ffff880810fbbca8   R8: 0000000000000001   R9: 0000000000000006
    R10: 0000000000000004  R11: 0000000000000005  R12: 0000000000000000
    R13: ffff88081e1c0000  R14: 0000000000000000  R15: ffff8807b7b7b070
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #8 [ffff880810fbbbc0] cfq_add_rq_rb at ffffffff8131ca8c
 #9 [ffff880810fbbcb0] handle_active_stripes at ffffffffc02f29dd [raid456]
#10 [ffff880810fbbd60] raid5d at ffffffffc02f3058 [raid456]
#11 [ffff880810fbbe50] md_thread at ffffffff8150ff55
#12 [ffff880810fbbec8] kthread at ffffffff810b252f
#13 [ffff880810fbbf50] ret_from_fork at ffffffff816b8798
crash>

Kernel log buffer shows :


        [    0.000000] Linux version 3.10.0-693.11.6.el7.x86_64 
(mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 4.8.5 20150623 
(Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 28 14:23:39 EST 2017
        [    0.000000] Command line: 
BOOT_IMAGE=/vmlinuz-3.10.0-693.11.6.el7.x86_64 
root=/dev/mapper/rhel_jujube_home-root00 ro crashkernel=auto 
rd.md.uuid=eb985b7e:2aabb72e:5b7099a1:5e65b912 
rd.lvm.lv=rhel_jujube_home/root00 rhgb quiet audit=1 LANG=en_US.UTF-8
        [    0.000000] e820: BIOS-provided physical RAM map:


        [   13.363484] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
        [   13.726701] md/raid:md63: device sdz3 operational as raid disk 5
        [   13.726714] md/raid:md63: device sdy3 operational as raid disk 4
        [   13.726715] md/raid:md63: device sdv3 operational as raid disk 3
        [   13.726717] md/raid:md63: device sdt3 operational as raid disk 1
        [   13.726718] md/raid:md63: device sdu3 operational as raid disk 2
        [   13.726719] md/raid:md63: device sds3 operational as raid disk 0
        [   13.727290] md/raid:md63: raid level 6 active with 6 out of 
6 devices, algorithm 2
        [   13.738152] md63: detected capacity change from 0 to 
15984257662976
        [   13.762596]
        [   13.762614] md60: detected capacity change from 0 to 
47952370335744
        [   13.850857] XFS (sdw2): Mounting V5 Filesystem
        [   13.850859] XFS (sdx1): Mounting V5 Filesystem


RAID device md63 was configured from 6 disks.

        [154819.837074] md: md127: data-check done.
        [177837.289243] perf: interrupt took too long (3937 > 3930), 
lowering kernel.perf_event_max_sample_rate to 50000
        [225059.137706] ata4.00: exception Emask 0x0 SAct 0x7fffff SErr 
0x0 action 0x0
        [225059.137760] ata4.00: irq_stat 0x40000008
        [225059.137788] ata4.00: failed command: READ FPDMA QUEUED
        [225059.137820] ata4.00: cmd 
60/00:00:88:7f:d6/04:00:c2:00:00/40 tag 0 ncq 524288 in
                                 res 
51/40:10:78:81:d6/00:02:c2:00:00/40 Emask 0x409 (media error) <F>
        [225059.137866] ata4.00: status: { DRDY ERR }
        [225059.137878] ata4.00: error: { UNC }
        [225059.143346] ata4.00: configured for UDMA/133
        [225059.143432] sd 5:0:0:0: [sdz] FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE
        [225059.143444] sd 5:0:0:0: [sdz] Sense Key : Medium Error 
[current] [descriptor]          <-----------
        [225059.143447] sd 5:0:0:0: [sdz] Add. Sense: Unrecovered read 
error - auto reallocate failed   <--------------
        [225059.143449] sd 5:0:0:0: [sdz] CDB: Read(16) 88 00 00 00 00 
00 c2 d6 7f 88 00 00 04 00 00 00
        [225059.143451] blk_update_request: I/O error, dev sdz, sector 
3268837752
        [225059.143512] ata4: EH complete
        [225074.952235] ata4.00: exception Emask 0x0 SAct 0x7f007fff 
SErr 0x0 action 0x0
        [225074.952286] ata4.00: irq_stat 0x40000008
        [225074.952314] ata4.00: failed command: READ FPDMA QUEUED
        [225074.952346] ata4.00: cmd 
60/78:c0:88:d7:d6/03:00:c2:00:00/40 tag 24 ncq 454656 in
                                 res 
51/40:60:a0:d7:d6/00:03:c2:00:00/40 Emask 0x409 (media error) <F>
        [225074.952393] ata4.00: status: { DRDY ERR }
        [225074.952406] ata4.00: error: { UNC }
        [225074.957723] ata4.00: configured for UDMA/133
        [225074.957827] sd 5:0:0:0: [sdz] FAILED Result: 
hostbyte=DID_OK driverbyte=DRIVER_SENSE
        [225074.957831] sd 5:0:0:0: [sdz] Sense Key : Medium Error 
[current] [descriptor]
        [225074.957834] sd 5:0:0:0: [sdz] Add. Sense: Unrecovered read 
error - auto reallocate failed
        [225074.957837] sd 5:0:0:0: [sdz] CDB: Read(16) 88 00 00 00 00 
00 c2 d6 d7 88 00 00 03 78 00 00
        [225074.957839] blk_update_request: I/O error, dev sdz, sector 
3268859808
        [225074.957910] ata4: EH complete


Medium Error was encountered which caused error recovery. Further we see 
that auto recllocation of disk failed too.

Relevant disk errors :

        [229064.244466] md/raid:md63: read error NOT corrected!! 
(sector 3376202512 on sdz3).
        [229064.244470] md/raid:md63: read error NOT corrected!! 
(sector 3376202520 on sdz3).
        [229064.244472] md/raid:md63: read error NOT corrected!! 
(sector 3376202528 on sdz3).
        [229064.244473] md/raid:md63: read error NOT corrected!! 
(sector 3376202536 on sdz3).
        [229064.244474] md/raid:md63: read error NOT corrected!! 
(sector 3376202544 on sdz3).
        [229064.244476] md/raid:md63: read error NOT corrected!! 
(sector 3376202552 on sdz3).
        [229064.244477] md/raid:md63: read error NOT corrected!! 
(sector 3376202560 on sdz3).
        [229064.244478] md/raid:md63: read error NOT corrected!! 
(sector 3376202568 on sdz3).
        [229064.244480] md/raid:md63: read error NOT corrected!! 
(sector 3376202576 on sdz3).
        [229064.244481] md/raid:md63: read error NOT corrected!! 
(sector 3376202584 on sdz3).


From the vmcore

crash> struct stripe_head_state.uptodate,failed ffff880810fbbbf0
  uptodate = 4
  failed = 1


The value used to set the disks variable:

crash> rd -32 ffff880810fbbb9c
ffff880810fbbb9c:  00000006

uptodate should have been 5 and not 4.