Hi,
在 2024/04/30 19:07, Nigel Croxon 写道:
On 4/25/24 12:52 PM, Song Liu wrote:
On Thu, Apr 25, 2024 at 5:10 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
On 4/24/24 2:57 AM, Yu Kuai wrote:
Hi, Nigel
在 2024/04/21 20:30, Nigel Croxon 写道:
On 4/20/24 2:09 AM, Yu Kuai wrote:
Hi,
在 2024/04/20 3:49, Nigel Croxon 写道:
There is a problem with this commit, it causes a CPU#x soft lockup
commit 301867b1c16805aebbc306aafa6ecdc68b73c7e5
Author: Li Nan <linan122@xxxxxxxxxx>
Date: Mon May 15 21:48:05 2023 +0800
md/raid10: check slab-out-of-bounds in md_bitmap_get_counter
Did you found this commit by bisect?
Yes, found this issue by bisecting...
Message from syslogd@rhel9 at Apr 19 14:14:55 ...
kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 26s!
[mdX_resync:6976]
dmesg:
[ 104.245585] CPU: 7 PID: 3588 Comm: mdX_resync Kdump: loaded Not
tainted 6.9.0-rc4-next-20240419 #1
[ 104.245588] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.2-1.fc38 04/01/2014
[ 104.245590] RIP: 0010:_raw_spin_unlock_irq+0x13/0x30
[ 104.245598] Code: 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90
90 90 90 90 90 90 90 90 0f 1f 44 00 00 c6 07 00 90 90 90 fb 65 ff
0d 95 9f 75 76 <74> 05 c3 cc cc cc cc 0f 1f 44 00 00 c3 cc cc cc cc
cc cc cc cc cc
[ 104.245601] RSP: 0018:ffffb2d74a81bbf8 EFLAGS: 00000246
[ 104.245603] RAX: 0000000000000000 RBX: 0000000001000000 RCX:
000000000000000c
[ 104.245604] RDX: 0000000000000000 RSI: 0000000001000000 RDI:
ffff926160ccd200
[ 104.245606] RBP: ffffb2d74a81bcd0 R08: 0000000000000013 R09:
0000000000000000
[ 104.245607] R10: 0000000000000000 R11: ffffb2d74a81bad8 R12:
0000000000000000
[ 104.245608] R13: 0000000000000000 R14: ffff926160ccd200 R15:
ffff926151019000
[ 104.245611] FS: 0000000000000000(0000)
GS:ffff9273f9580000(0000) knlGS:0000000000000000
[ 104.245613] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 104.245614] CR2: 00007f23774d2584 CR3: 0000000104098003 CR4:
0000000000370ef0
[ 104.245616] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 104.245617] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 104.245618] Call Trace:
[ 104.245620] <IRQ>
[ 104.245623] ? watchdog_timer_fn+0x1e3/0x260
[ 104.245630] ? __pfx_watchdog_timer_fn+0x10/0x10
[ 104.245634] ? __hrtimer_run_queues+0x112/0x2a0
[ 104.245638] ? hrtimer_interrupt+0xff/0x240
[ 104.245640] ? sched_clock+0xc/0x30
[ 104.245644] ? __sysvec_apic_timer_interrupt+0x54/0x140
[ 104.245649] ? sysvec_apic_timer_interrupt+0x6c/0x90
[ 104.245652] </IRQ>
[ 104.245653] <TASK>
[ 104.245654] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[ 104.245659] ? _raw_spin_unlock_irq+0x13/0x30
[ 104.245661] md_bitmap_start_sync+0x6b/0xf0
Can you give the following patch a test as well? I believe this is
the root cause why page > bitmap->pages, dm-raid is using the wrong
bitmap size.
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index abe88d1e6735..d9c65ef9c9fb 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -4052,7 +4052,8 @@ static int raid_preresume(struct dm_target *ti)
mddev->bitmap_info.chunksize !=
to_bytes(rs->requested_bitmap_chunk_sectors)))) {
int chunksize =
to_bytes(rs->requested_bitmap_chunk_sectors) ?:
mddev->bitmap_info.chunksize;
- r = md_bitmap_resize(mddev->bitmap,
mddev->dev_sectors, chunksize, 0);
+ r = md_bitmap_resize(mddev->bitmap,
mddev->resync_max_sectors,
+ chunksize, 0);
if (r)
DMERR("Failed to resize bitmap");
}
Thanks,
Kuai
Hello Kaui,
Tested and found no issues. Good to go..
-Nigel
Thanks for the fixes and the tests.
For the next step, do we need both patches or just one of them?
Song
They both fix the problem independently without the other.
Sorry that I forgot to reply here, we discussed this on slack...
For md/raid, we already apply the first patch to fix the soft lockup
problem, for dm-raid, other than the second patch to fix wrong bitmap
size, we still need more changes, because some fields in mddev for
dm-raid10 and dm-raid5 are different, while dm-raid doesn't distinguish
them. I'm working on that, however, I'm not that familiar with dm-raid
and I need more time. :)
Thanks,
Kuai
-Nigel
.