Re: regression: CPU soft lockup with raid10: check slab-out-of-bounds in md_bitmap_get_counter

Nigel Croxon <ncroxon@xxxxxxxxxx> · Sun, 21 Apr 2024 08:30:34 -0400

On 4/20/24 2:09 AM, Yu Kuai wrote:
Hi,

在 2024/04/20 3:49, Nigel Croxon 写道:
There is a problem with this commit, it causes a CPU#x soft lockup

commit 301867b1c16805aebbc306aafa6ecdc68b73c7e5
Author: Li Nan <linan122@xxxxxxxxxx>
Date:   Mon May 15 21:48:05 2023 +0800
md/raid10: check slab-out-of-bounds in md_bitmap_get_counter


Did you found this commit by bisect?

Yes, found this issue by bisecting...

Message from syslogd@rhel9 at Apr 19 14:14:55 ...
  kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 26s! 
[mdX_resync:6976]

dmesg:

[  104.245585] CPU: 7 PID: 3588 Comm: mdX_resync Kdump: loaded Not 
tainted 6.9.0-rc4-next-20240419 #1
[  104.245588] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), 
BIOS 1.16.2-1.fc38 04/01/2014
[  104.245590] RIP: 0010:_raw_spin_unlock_irq+0x13/0x30
[  104.245598] Code: 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 
90 90 90 90 90 90 90 0f 1f 44 00 00 c6 07 00 90 90 90 fb 65 ff 0d 95 
9f 75 76 <74> 05 c3 cc cc cc cc 0f 1f 44 00 00 c3 cc cc cc cc cc cc 
cc cc cc
[  104.245601] RSP: 0018:ffffb2d74a81bbf8 EFLAGS: 00000246
[  104.245603] RAX: 0000000000000000 RBX: 0000000001000000 RCX: 
000000000000000c
[  104.245604] RDX: 0000000000000000 RSI: 0000000001000000 RDI: 
ffff926160ccd200
[  104.245606] RBP: ffffb2d74a81bcd0 R08: 0000000000000013 R09: 
0000000000000000
[  104.245607] R10: 0000000000000000 R11: ffffb2d74a81bad8 R12: 
0000000000000000
[  104.245608] R13: 0000000000000000 R14: ffff926160ccd200 R15: 
ffff926151019000
[  104.245611] FS:  0000000000000000(0000) GS:ffff9273f9580000(0000) 
knlGS:0000000000000000
[  104.245613] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  104.245614] CR2: 00007f23774d2584 CR3: 0000000104098003 CR4: 
0000000000370ef0
[  104.245616] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[  104.245617] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[  104.245618] Call Trace:
[  104.245620]  <IRQ>
[  104.245623]  ? watchdog_timer_fn+0x1e3/0x260
[  104.245630]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  104.245634]  ? __hrtimer_run_queues+0x112/0x2a0
[  104.245638]  ? hrtimer_interrupt+0xff/0x240
[  104.245640]  ? sched_clock+0xc/0x30
[  104.245644]  ? __sysvec_apic_timer_interrupt+0x54/0x140
[  104.245649]  ? sysvec_apic_timer_interrupt+0x6c/0x90
[  104.245652]  </IRQ>
[  104.245653]  <TASK>
[  104.245654]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[  104.245659]  ? _raw_spin_unlock_irq+0x13/0x30
[  104.245661]  md_bitmap_start_sync+0x6b/0xf0

Looks like you trigger the condition from md_bitmap_get_counter():

page >= bitmap->pages

by the command lvextend + lvchange --syncaction. And because
md_bitmap_get_counter() return NULL with sync_blocks set to 0,
raid10_sync_request() can't make progress and stuck in a dead loop.

There are two problems here:

1) Looks like lvextend doesn't resize the bitmap, I don't know about
lvextend but this can explain why the condition can be triggered.

2) raid10_sync_request() should handle this case, by:
 a) keeping syncing ranges beyond bitmap;
 b) skip syncing reanges beyond bitmap;

Following is a patch to fix this problem by 2-b, which is the same
before 301867b1c16805aebbc306aafa6ecdc68b73c7e5. However, 1) still need
to be fixed, otherwise, data beyond bitmap ranges will never sync.

Nigel, can you give this patch a test?

Hello Kuai,

I tested your patch under this failing environment and it works.

-Nigel

Thanks,
Kuai

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 9672f75c3050..26e40991369a 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -1424,15 +1424,17 @@ __acquires(bitmap->lock)
        sector_t chunk = offset >> bitmap->chunkshift;
        unsigned long page = chunk >> PAGE_COUNTER_SHIFT;
        unsigned long pageoff = (chunk & PAGE_COUNTER_MASK) << 
COUNTER_BYTE_SHIFT;
-       sector_t csize;
+       sector_t csize = ((sector_t)1) << bitmap->chunkshift;
        int err;

+
        if (page >= bitmap->pages) {
                /*
                 * This can happen if bitmap_start_sync goes beyond
                 * End-of-device while looking for a whole page or
                 * user set a huge number to sysfs bitmap_set_bits.
                 */
+               *blocks = csize - (offset & (csize - 1));
                return NULL;
        }
        err = md_bitmap_checkpage(bitmap, page, create, 0);
@@ -1441,8 +1443,7 @@ __acquires(bitmap->lock)
            bitmap->bp[page].map == NULL)
                csize = ((sector_t)1) << (bitmap->chunkshift +
                                          PAGE_COUNTER_SHIFT);
-       else
-               csize = ((sector_t)1) << bitmap->chunkshift;
+
        *blocks = csize - (offset & (csize - 1));

        if (err < 0)

[  104.245668] raid10_sync_request+0x25c/0x1b40 [raid10]
[  104.245676]  ? is_mddev_idle+0x132/0x150
[  104.245680]  md_do_sync+0x64b/0x1020
[  104.245683]  ? __pfx_autoremove_wake_function+0x10/0x10
[  104.245690]  md_thread+0xa7/0x170
[  104.245693]  ? __pfx_md_thread+0x10/0x10
[  104.245696]  kthread+0xcf/0x100
[  104.245700]  ? __pfx_kthread+0x10/0x10
[  104.245704]  ret_from_fork+0x30/0x50
[  104.245707]  ? __pfx_kthread+0x10/0x10
[  104.245710]  ret_from_fork_asm+0x1a/0x30
[  104.245714]  </TASK>

When you run the reproducer script below...

#!/bin/sh
vg=t
lv=t
devs="/dev/sd[c-j]"
sz=3G
isz=2G
path=/dev/$vg/$lv
mnt=/mnt/$lv

vgcreate -y $vg $devs
lvcreate --yes --nosync --type raid10 -i 2 -n $lv -L $sz $vg

mkfs.xfs $path
mkdir -p $mnt
mount $path $mnt
df -h

for i in {1..10}
do
     lvextend -y -L +$isz -r $path
     lvs
done

lvs -a -o +devices
lvchange --syncaction check $path
#lvs -ovgname,lvname,copypercent t/t         <-- this cmd to watch


.