On Thu, Apr 25, 2024 at 5:10 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote: > > > On 4/24/24 2:57 AM, Yu Kuai wrote: > > Hi, Nigel > > > > 在 2024/04/21 20:30, Nigel Croxon 写道: > >> > >> On 4/20/24 2:09 AM, Yu Kuai wrote: > >>> Hi, > >>> > >>> 在 2024/04/20 3:49, Nigel Croxon 写道: > >>>> There is a problem with this commit, it causes a CPU#x soft lockup > >>>> > >>>> commit 301867b1c16805aebbc306aafa6ecdc68b73c7e5 > >>>> Author: Li Nan <linan122@xxxxxxxxxx> > >>>> Date: Mon May 15 21:48:05 2023 +0800 > >>>> md/raid10: check slab-out-of-bounds in md_bitmap_get_counter > >>>> > >>> > >>> Did you found this commit by bisect? > >>> > >> Yes, found this issue by bisecting... > >> > >>>> Message from syslogd@rhel9 at Apr 19 14:14:55 ... > >>>> kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 26s! > >>>> [mdX_resync:6976] > >>>> > >>>> dmesg: > >>>> > >>>> [ 104.245585] CPU: 7 PID: 3588 Comm: mdX_resync Kdump: loaded Not > >>>> tainted 6.9.0-rc4-next-20240419 #1 > >>>> [ 104.245588] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), > >>>> BIOS 1.16.2-1.fc38 04/01/2014 > >>>> [ 104.245590] RIP: 0010:_raw_spin_unlock_irq+0x13/0x30 > >>>> [ 104.245598] Code: 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 > >>>> 90 90 90 90 90 90 90 90 0f 1f 44 00 00 c6 07 00 90 90 90 fb 65 ff > >>>> 0d 95 9f 75 76 <74> 05 c3 cc cc cc cc 0f 1f 44 00 00 c3 cc cc cc cc > >>>> cc cc cc cc cc > >>>> [ 104.245601] RSP: 0018:ffffb2d74a81bbf8 EFLAGS: 00000246 > >>>> [ 104.245603] RAX: 0000000000000000 RBX: 0000000001000000 RCX: > >>>> 000000000000000c > >>>> [ 104.245604] RDX: 0000000000000000 RSI: 0000000001000000 RDI: > >>>> ffff926160ccd200 > >>>> [ 104.245606] RBP: ffffb2d74a81bcd0 R08: 0000000000000013 R09: > >>>> 0000000000000000 > >>>> [ 104.245607] R10: 0000000000000000 R11: ffffb2d74a81bad8 R12: > >>>> 0000000000000000 > >>>> [ 104.245608] R13: 0000000000000000 R14: ffff926160ccd200 R15: > >>>> ffff926151019000 > >>>> [ 104.245611] FS: 0000000000000000(0000) > >>>> GS:ffff9273f9580000(0000) knlGS:0000000000000000 > >>>> [ 104.245613] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>>> [ 104.245614] CR2: 00007f23774d2584 CR3: 0000000104098003 CR4: > >>>> 0000000000370ef0 > >>>> [ 104.245616] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > >>>> 0000000000000000 > >>>> [ 104.245617] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > >>>> 0000000000000400 > >>>> [ 104.245618] Call Trace: > >>>> [ 104.245620] <IRQ> > >>>> [ 104.245623] ? watchdog_timer_fn+0x1e3/0x260 > >>>> [ 104.245630] ? __pfx_watchdog_timer_fn+0x10/0x10 > >>>> [ 104.245634] ? __hrtimer_run_queues+0x112/0x2a0 > >>>> [ 104.245638] ? hrtimer_interrupt+0xff/0x240 > >>>> [ 104.245640] ? sched_clock+0xc/0x30 > >>>> [ 104.245644] ? __sysvec_apic_timer_interrupt+0x54/0x140 > >>>> [ 104.245649] ? sysvec_apic_timer_interrupt+0x6c/0x90 > >>>> [ 104.245652] </IRQ> > >>>> [ 104.245653] <TASK> > >>>> [ 104.245654] ? asm_sysvec_apic_timer_interrupt+0x16/0x20 > >>>> [ 104.245659] ? _raw_spin_unlock_irq+0x13/0x30 > >>>> [ 104.245661] md_bitmap_start_sync+0x6b/0xf0 > > > > Can you give the following patch a test as well? I believe this is > > the root cause why page > bitmap->pages, dm-raid is using the wrong > > bitmap size. > > > > diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c > > index abe88d1e6735..d9c65ef9c9fb 100644 > > --- a/drivers/md/dm-raid.c > > +++ b/drivers/md/dm-raid.c > > @@ -4052,7 +4052,8 @@ static int raid_preresume(struct dm_target *ti) > > mddev->bitmap_info.chunksize != > > to_bytes(rs->requested_bitmap_chunk_sectors)))) { > > int chunksize = > > to_bytes(rs->requested_bitmap_chunk_sectors) ?: > > mddev->bitmap_info.chunksize; > > > > - r = md_bitmap_resize(mddev->bitmap, > > mddev->dev_sectors, chunksize, 0); > > + r = md_bitmap_resize(mddev->bitmap, > > mddev->resync_max_sectors, > > + chunksize, 0); > > if (r) > > DMERR("Failed to resize bitmap"); > > } > > > > Thanks, > > Kuai > > Hello Kaui, > > Tested and found no issues. Good to go.. > > -Nigel Thanks for the fixes and the tests. For the next step, do we need both patches or just one of them? Song