On 4/3/19 9:18 PM, Dan Williams wrote:
On Thu, Mar 28, 2019 at 9:57 AM Song Liu <liu.song.a23@xxxxxxxxx> wrote:
On Thu, Mar 28, 2019 at 5:16 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
On 3/26/19 7:33 PM, Song Liu wrote:
On Tue, Mar 26, 2019 at 9:08 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
Changing state from check_state_check_result to
check_state_compute_result not only is unsafe but also doesn't
appear to serve a valid purpose. A raid6 check should only be
pushing out extra writes if doing repair and a mis-match occurs.
The stripe dev management will already try and do repair writes
for failing sectors.
Could you please explain more details about the failed case before
this patch?
Thanks,
Song
Reproduced the issue with a modified scsi_debug module. Redefined OPT_MEDIUM_ERR_ADDR to 12000 to move the I/O error where I wanted in on the disk. I loaded the modified scsi_debug with parameter dev_size_mb=250 to make it bigger and partitioned it into partitions to make a test raid6 device out of.
Device Boot Start End Blocks Id System
/dev/sde1 2048 94207 46080 83 Linux
/dev/sde2 94208 186367 46080 83 Linux
/dev/sde3 186368 278527 46080 83 Linux
/dev/sde4 278528 511999 116736 5 Extended
/dev/sde5 280576 372735 46080 83 Linux
/dev/sde6 374784 466943 46080 83 Linux
The raid6 device was created with:
mdadm --create /dev/md127 --level=6 --raid-devices=5 /dev/sde1 /dev/sde2 /dev/sde3 /dev/sde5 /dev/sde6
After initially running, a check can be triggered and it work without issue. The medium error feature of scsi_debug can then be turned on with:
echo "2" >/sys/module/scsi_debug/parameters/opts
The sector 12000, which is part of sde1, will now fail to read and contains the q parity information for its stripe of the raid6 array. Running a new check operation crashes the system at the same BUG_ON as seen by the customer.
This patch makes the raid6 check_state_check_result handling
work more like raid5's. If somehow too many failures for a
check, just quit the check operation for the stripe. When any
checks pass, don't try and use check_state_compute_result for
a purpose it isn't needed for and is unsafe for. Just mark the
stripe as in sync for passing its parity checks and let the
stripe dev read/write code and the bad blocks list do their
job handling I/O errors.
+ Dan Williams. But I am not sure whether Dan still remembers this
work. :)
Dan, please feel free to ignore this. :)
Thanks Nigel and Xiao for these details. I will process the patch.
Regards,
Song
OriginalAuthor: David Jeffery <djeffery@xxxxxxxxxx>
Tested-by: David Jeffery <djeffery@xxxxxxxxxx>
Signed-off-by: David Jeffy <djeffery@xxxxxxxxxx>
Signed-off-by: Nigel Croxon <ncroxon@xxxxxxxxxx>
---
drivers/md/raid5.c | 19 ++++---------------
1 file changed, 4 insertions(+), 15 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c033bfcb209e..4204a972ecf2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4223,26 +4223,15 @@ static void handle_parity_checks6(struct r5conf *conf, struct stripe_head *sh,
case check_state_check_result:
sh->check_state = check_state_idle;
+ if (s->failed > 1)
+ break;
/* handle a successful check operation, if parity is correct
* we are done. Otherwise update the mismatch count and repair
* parity if !MD_RECOVERY_CHECK
*/
if (sh->ops.zero_sum_result == 0) {
- /* both parities are correct */
- if (!s->failed)
- set_bit(STRIPE_INSYNC, &sh->state);
- else {
- /* in contrast to the raid5 case we can validate
- * parity, but still have a failure to write
- * back
- */
- sh->check_state = check_state_compute_result;
- /* Returning at this point means that we may go
- * off and bring p and/or q uptodate again so
- * we make sure to check zero_sum_result again
- * to verify if p or q need writeback
- */
- }
+ /* Any parity checked was correct */
+ set_bit(STRIPE_INSYNC, &sh->state);
The details of why I did it this way have left my brain, but the
changelog does not address the concern raised in the comments about
needing to write-back the computed-result.
It also seems like a hack to just jump straight to the check-state
being idle without performing any writeback. My concern is that the
stripe is brought in-sync in memory, but not written to the on-disk
stripe.
Can you clarify which BUG_ON is being triggered?
BUG_ON(s->uptodate < disks - 1) condition encountered in RAID6 array
with 6 disks and while one disk had failed.
crash> bt
PID: 1138 TASK: ffff88081cba4f10 CPU: 7 COMMAND: "md63_raid6"
#0 [ffff880810fbb7f0] machine_kexec at ffffffff8105c58b
#1 [ffff880810fbb850] __crash_kexec at ffffffff81106742
#2 [ffff880810fbb920] crash_kexec at ffffffff81106830
#3 [ffff880810fbb938] oops_end at ffffffff816b0aa8
#4 [ffff880810fbb960] die at ffffffff8102e87b
#5 [ffff880810fbb990] do_trap at ffffffff816b01f0
#6 [ffff880810fbb9e0] do_invalid_op at ffffffff8102b174
#7 [ffff880810fbba90] invalid_op at ffffffff816bd1ae
[exception RIP: handle_parity_checks6+783]
RIP: ffffffffc02e5f6f RSP: ffff880810fbbb40 RFLAGS: 00010297
RAX: 000000000000202b RBX: 0000000000000000 RCX: 0000000000000005
RDX: ffff880810fbbbf0 RSI: ffff8807b7b7b070 RDI: ffff88081e1c0028
RBP: ffff880810fbbca8 R8: 0000000000000001 R9: 0000000000000006
R10: 0000000000000004 R11: 0000000000000005 R12: 0000000000000000
R13: ffff88081e1c0000 R14: 0000000000000000 R15: ffff8807b7b7b070
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffff880810fbbbc0] cfq_add_rq_rb at ffffffff8131ca8c
#9 [ffff880810fbbcb0] handle_active_stripes at ffffffffc02f29dd [raid456]
#10 [ffff880810fbbd60] raid5d at ffffffffc02f3058 [raid456]
#11 [ffff880810fbbe50] md_thread at ffffffff8150ff55
#12 [ffff880810fbbec8] kthread at ffffffff810b252f
#13 [ffff880810fbbf50] ret_from_fork at ffffffff816b8798
crash>
Kernel log buffer shows :
[ 0.000000] Linux version 3.10.0-693.11.6.el7.x86_64
(mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 4.8.5 20150623
(Red Hat 4.8.5-16) (GCC) ) #1 SMP Thu Dec 28 14:23:39 EST 2017
[ 0.000000] Command line:
BOOT_IMAGE=/vmlinuz-3.10.0-693.11.6.el7.x86_64
root=/dev/mapper/rhel_jujube_home-root00 ro crashkernel=auto
rd.md.uuid=eb985b7e:2aabb72e:5b7099a1:5e65b912
rd.lvm.lv=rhel_jujube_home/root00 rhgb quiet audit=1 LANG=en_US.UTF-8
[ 0.000000] e820: BIOS-provided physical RAM map:
[ 13.363484] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
[ 13.726701] md/raid:md63: device sdz3 operational as raid disk 5
[ 13.726714] md/raid:md63: device sdy3 operational as raid disk 4
[ 13.726715] md/raid:md63: device sdv3 operational as raid disk 3
[ 13.726717] md/raid:md63: device sdt3 operational as raid disk 1
[ 13.726718] md/raid:md63: device sdu3 operational as raid disk 2
[ 13.726719] md/raid:md63: device sds3 operational as raid disk 0
[ 13.727290] md/raid:md63: raid level 6 active with 6 out of
6 devices, algorithm 2
[ 13.738152] md63: detected capacity change from 0 to
15984257662976
[ 13.762596]
[ 13.762614] md60: detected capacity change from 0 to
47952370335744
[ 13.850857] XFS (sdw2): Mounting V5 Filesystem
[ 13.850859] XFS (sdx1): Mounting V5 Filesystem
RAID device md63 was configured from 6 disks.
[154819.837074] md: md127: data-check done.
[177837.289243] perf: interrupt took too long (3937 > 3930),
lowering kernel.perf_event_max_sample_rate to 50000
[225059.137706] ata4.00: exception Emask 0x0 SAct 0x7fffff SErr
0x0 action 0x0
[225059.137760] ata4.00: irq_stat 0x40000008
[225059.137788] ata4.00: failed command: READ FPDMA QUEUED
[225059.137820] ata4.00: cmd
60/00:00:88:7f:d6/04:00:c2:00:00/40 tag 0 ncq 524288 in
res
51/40:10:78:81:d6/00:02:c2:00:00/40 Emask 0x409 (media error) <F>
[225059.137866] ata4.00: status: { DRDY ERR }
[225059.137878] ata4.00: error: { UNC }
[225059.143346] ata4.00: configured for UDMA/133
[225059.143432] sd 5:0:0:0: [sdz] FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[225059.143444] sd 5:0:0:0: [sdz] Sense Key : Medium Error
[current] [descriptor] <-----------
[225059.143447] sd 5:0:0:0: [sdz] Add. Sense: Unrecovered read
error - auto reallocate failed <--------------
[225059.143449] sd 5:0:0:0: [sdz] CDB: Read(16) 88 00 00 00 00
00 c2 d6 7f 88 00 00 04 00 00 00
[225059.143451] blk_update_request: I/O error, dev sdz, sector
3268837752
[225059.143512] ata4: EH complete
[225074.952235] ata4.00: exception Emask 0x0 SAct 0x7f007fff
SErr 0x0 action 0x0
[225074.952286] ata4.00: irq_stat 0x40000008
[225074.952314] ata4.00: failed command: READ FPDMA QUEUED
[225074.952346] ata4.00: cmd
60/78:c0:88:d7:d6/03:00:c2:00:00/40 tag 24 ncq 454656 in
res
51/40:60:a0:d7:d6/00:03:c2:00:00/40 Emask 0x409 (media error) <F>
[225074.952393] ata4.00: status: { DRDY ERR }
[225074.952406] ata4.00: error: { UNC }
[225074.957723] ata4.00: configured for UDMA/133
[225074.957827] sd 5:0:0:0: [sdz] FAILED Result:
hostbyte=DID_OK driverbyte=DRIVER_SENSE
[225074.957831] sd 5:0:0:0: [sdz] Sense Key : Medium Error
[current] [descriptor]
[225074.957834] sd 5:0:0:0: [sdz] Add. Sense: Unrecovered read
error - auto reallocate failed
[225074.957837] sd 5:0:0:0: [sdz] CDB: Read(16) 88 00 00 00 00
00 c2 d6 d7 88 00 00 03 78 00 00
[225074.957839] blk_update_request: I/O error, dev sdz, sector
3268859808
[225074.957910] ata4: EH complete
Medium Error was encountered which caused error recovery. Further we see
that auto recllocation of disk failed too.
Relevant disk errors :
[229064.244466] md/raid:md63: read error NOT corrected!!
(sector 3376202512 on sdz3).
[229064.244470] md/raid:md63: read error NOT corrected!!
(sector 3376202520 on sdz3).
[229064.244472] md/raid:md63: read error NOT corrected!!
(sector 3376202528 on sdz3).
[229064.244473] md/raid:md63: read error NOT corrected!!
(sector 3376202536 on sdz3).
[229064.244474] md/raid:md63: read error NOT corrected!!
(sector 3376202544 on sdz3).
[229064.244476] md/raid:md63: read error NOT corrected!!
(sector 3376202552 on sdz3).
[229064.244477] md/raid:md63: read error NOT corrected!!
(sector 3376202560 on sdz3).
[229064.244478] md/raid:md63: read error NOT corrected!!
(sector 3376202568 on sdz3).
[229064.244480] md/raid:md63: read error NOT corrected!!
(sector 3376202576 on sdz3).
[229064.244481] md/raid:md63: read error NOT corrected!!
(sector 3376202584 on sdz3).
From the vmcore
crash> struct stripe_head_state.uptodate,failed ffff880810fbbbf0
uptodate = 4
failed = 1
The value used to set the disks variable:
crash> rd -32 ffff880810fbbb9c
ffff880810fbbb9c: 00000006
uptodate should have been 5 and not 4.