Hi,
在 2024/04/20 3:49, Nigel Croxon 写道:
There is a problem with this commit, it causes a CPU#x soft lockup
commit 301867b1c16805aebbc306aafa6ecdc68b73c7e5
Author: Li Nan <linan122@xxxxxxxxxx>
Date: Mon May 15 21:48:05 2023 +0800
md/raid10: check slab-out-of-bounds in md_bitmap_get_counter
Did you found this commit by bisect?
Message from syslogd@rhel9 at Apr 19 14:14:55 ...
kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 26s!
[mdX_resync:6976]
dmesg:
[ 104.245585] CPU: 7 PID: 3588 Comm: mdX_resync Kdump: loaded Not
tainted 6.9.0-rc4-next-20240419 #1
[ 104.245588] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.2-1.fc38 04/01/2014
[ 104.245590] RIP: 0010:_raw_spin_unlock_irq+0x13/0x30
[ 104.245598] Code: 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 0f 1f 44 00 00 c6 07 00 90 90 90 fb 65 ff 0d 95 9f 75
76 <74> 05 c3 cc cc cc cc 0f 1f 44 00 00 c3 cc cc cc cc cc cc cc cc cc
[ 104.245601] RSP: 0018:ffffb2d74a81bbf8 EFLAGS: 00000246
[ 104.245603] RAX: 0000000000000000 RBX: 0000000001000000 RCX:
000000000000000c
[ 104.245604] RDX: 0000000000000000 RSI: 0000000001000000 RDI:
ffff926160ccd200
[ 104.245606] RBP: ffffb2d74a81bcd0 R08: 0000000000000013 R09:
0000000000000000
[ 104.245607] R10: 0000000000000000 R11: ffffb2d74a81bad8 R12:
0000000000000000
[ 104.245608] R13: 0000000000000000 R14: ffff926160ccd200 R15:
ffff926151019000
[ 104.245611] FS: 0000000000000000(0000) GS:ffff9273f9580000(0000)
knlGS:0000000000000000
[ 104.245613] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 104.245614] CR2: 00007f23774d2584 CR3: 0000000104098003 CR4:
0000000000370ef0
[ 104.245616] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 104.245617] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 104.245618] Call Trace:
[ 104.245620] <IRQ>
[ 104.245623] ? watchdog_timer_fn+0x1e3/0x260
[ 104.245630] ? __pfx_watchdog_timer_fn+0x10/0x10
[ 104.245634] ? __hrtimer_run_queues+0x112/0x2a0
[ 104.245638] ? hrtimer_interrupt+0xff/0x240
[ 104.245640] ? sched_clock+0xc/0x30
[ 104.245644] ? __sysvec_apic_timer_interrupt+0x54/0x140
[ 104.245649] ? sysvec_apic_timer_interrupt+0x6c/0x90
[ 104.245652] </IRQ>
[ 104.245653] <TASK>
[ 104.245654] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 104.245659] ? _raw_spin_unlock_irq+0x13/0x30
[ 104.245661] md_bitmap_start_sync+0x6b/0xf0
Looks like you trigger the condition from md_bitmap_get_counter():
page >= bitmap->pages
by the command lvextend + lvchange --syncaction. And because
md_bitmap_get_counter() return NULL with sync_blocks set to 0,
raid10_sync_request() can't make progress and stuck in a dead loop.
There are two problems here:
1) Looks like lvextend doesn't resize the bitmap, I don't know about
lvextend but this can explain why the condition can be triggered.
2) raid10_sync_request() should handle this case, by:
a) keeping syncing ranges beyond bitmap;
b) skip syncing reanges beyond bitmap;
Following is a patch to fix this problem by 2-b, which is the same
before 301867b1c16805aebbc306aafa6ecdc68b73c7e5. However, 1) still need
to be fixed, otherwise, data beyond bitmap ranges will never sync.
Nigel, can you give this patch a test?
Thanks,
Kuai
diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 9672f75c3050..26e40991369a 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -1424,15 +1424,17 @@ __acquires(bitmap->lock)
sector_t chunk = offset >> bitmap->chunkshift;
unsigned long page = chunk >> PAGE_COUNTER_SHIFT;
unsigned long pageoff = (chunk & PAGE_COUNTER_MASK) <<
COUNTER_BYTE_SHIFT;
- sector_t csize;
+ sector_t csize = ((sector_t)1) << bitmap->chunkshift;
int err;
+
if (page >= bitmap->pages) {
/*
* This can happen if bitmap_start_sync goes beyond
* End-of-device while looking for a whole page or
* user set a huge number to sysfs bitmap_set_bits.
*/
+ *blocks = csize - (offset & (csize - 1));
return NULL;
}
err = md_bitmap_checkpage(bitmap, page, create, 0);
@@ -1441,8 +1443,7 @@ __acquires(bitmap->lock)
bitmap->bp[page].map == NULL)
csize = ((sector_t)1) << (bitmap->chunkshift +
PAGE_COUNTER_SHIFT);
- else
- csize = ((sector_t)1) << bitmap->chunkshift;
+
*blocks = csize - (offset & (csize - 1));
if (err < 0)
[ 104.245668] raid10_sync_request+0x25c/0x1b40 [raid10]
[ 104.245676] ? is_mddev_idle+0x132/0x150
[ 104.245680] md_do_sync+0x64b/0x1020
[ 104.245683] ? __pfx_autoremove_wake_function+0x10/0x10
[ 104.245690] md_thread+0xa7/0x170
[ 104.245693] ? __pfx_md_thread+0x10/0x10
[ 104.245696] kthread+0xcf/0x100
[ 104.245700] ? __pfx_kthread+0x10/0x10
[ 104.245704] ret_from_fork+0x30/0x50
[ 104.245707] ? __pfx_kthread+0x10/0x10
[ 104.245710] ret_from_fork_asm+0x1a/0x30
[ 104.245714] </TASK>
When you run the reproducer script below...
#!/bin/sh
vg=t
lv=t
devs="/dev/sd[c-j]"
sz=3G
isz=2G
path=/dev/$vg/$lv
mnt=/mnt/$lv
vgcreate -y $vg $devs
lvcreate --yes --nosync --type raid10 -i 2 -n $lv -L $sz $vg
mkfs.xfs $path
mkdir -p $mnt
mount $path $mnt
df -h
for i in {1..10}
do
lvextend -y -L +$isz -r $path
lvs
done
lvs -a -o +devices
lvchange --syncaction check $path
#lvs -ovgname,lvname,copypercent t/t <-- this cmd to watch
.