Re: regression: CPU soft lockup with raid10: check slab-out-of-bounds in md_bitmap_get_counter

Song Liu <song@xxxxxxxxxx> · Thu, 25 Apr 2024 09:52:10 -0700

On Thu, Apr 25, 2024 at 5:10 AM Nigel Croxon <ncroxon@xxxxxxxxxx> wrote:
>
>
> On 4/24/24 2:57 AM, Yu Kuai wrote:
> > Hi, Nigel
> >
> > 在 2024/04/21 20:30, Nigel Croxon 写道:
> >>
> >> On 4/20/24 2:09 AM, Yu Kuai wrote:
> >>> Hi,
> >>>
> >>> 在 2024/04/20 3:49, Nigel Croxon 写道:
> >>>> There is a problem with this commit, it causes a CPU#x soft lockup
> >>>>
> >>>> commit 301867b1c16805aebbc306aafa6ecdc68b73c7e5
> >>>> Author: Li Nan <linan122@xxxxxxxxxx>
> >>>> Date:   Mon May 15 21:48:05 2023 +0800
> >>>> md/raid10: check slab-out-of-bounds in md_bitmap_get_counter
> >>>>
> >>>
> >>> Did you found this commit by bisect?
> >>>
> >> Yes, found this issue by bisecting...
> >>
> >>>> Message from syslogd@rhel9 at Apr 19 14:14:55 ...
> >>>>   kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 26s!
> >>>> [mdX_resync:6976]
> >>>>
> >>>> dmesg:
> >>>>
> >>>> [  104.245585] CPU: 7 PID: 3588 Comm: mdX_resync Kdump: loaded Not
> >>>> tainted 6.9.0-rc4-next-20240419 #1
> >>>> [  104.245588] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> >>>> BIOS 1.16.2-1.fc38 04/01/2014
> >>>> [  104.245590] RIP: 0010:_raw_spin_unlock_irq+0x13/0x30
> >>>> [  104.245598] Code: 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90
> >>>> 90 90 90 90 90 90 90 90 0f 1f 44 00 00 c6 07 00 90 90 90 fb 65 ff
> >>>> 0d 95 9f 75 76 <74> 05 c3 cc cc cc cc 0f 1f 44 00 00 c3 cc cc cc cc
> >>>> cc cc cc cc cc
> >>>> [  104.245601] RSP: 0018:ffffb2d74a81bbf8 EFLAGS: 00000246
> >>>> [  104.245603] RAX: 0000000000000000 RBX: 0000000001000000 RCX:
> >>>> 000000000000000c
> >>>> [  104.245604] RDX: 0000000000000000 RSI: 0000000001000000 RDI:
> >>>> ffff926160ccd200
> >>>> [  104.245606] RBP: ffffb2d74a81bcd0 R08: 0000000000000013 R09:
> >>>> 0000000000000000
> >>>> [  104.245607] R10: 0000000000000000 R11: ffffb2d74a81bad8 R12:
> >>>> 0000000000000000
> >>>> [  104.245608] R13: 0000000000000000 R14: ffff926160ccd200 R15:
> >>>> ffff926151019000
> >>>> [  104.245611] FS:  0000000000000000(0000)
> >>>> GS:ffff9273f9580000(0000) knlGS:0000000000000000
> >>>> [  104.245613] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>>> [  104.245614] CR2: 00007f23774d2584 CR3: 0000000104098003 CR4:
> >>>> 0000000000370ef0
> >>>> [  104.245616] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> >>>> 0000000000000000
> >>>> [  104.245617] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> >>>> 0000000000000400
> >>>> [  104.245618] Call Trace:
> >>>> [  104.245620]  <IRQ>
> >>>> [  104.245623]  ? watchdog_timer_fn+0x1e3/0x260
> >>>> [  104.245630]  ? __pfx_watchdog_timer_fn+0x10/0x10
> >>>> [  104.245634]  ? __hrtimer_run_queues+0x112/0x2a0
> >>>> [  104.245638]  ? hrtimer_interrupt+0xff/0x240
> >>>> [  104.245640]  ? sched_clock+0xc/0x30
> >>>> [  104.245644]  ? __sysvec_apic_timer_interrupt+0x54/0x140
> >>>> [  104.245649]  ? sysvec_apic_timer_interrupt+0x6c/0x90
> >>>> [  104.245652]  </IRQ>
> >>>> [  104.245653]  <TASK>
> >>>> [  104.245654]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
> >>>> [  104.245659]  ? _raw_spin_unlock_irq+0x13/0x30
> >>>> [  104.245661]  md_bitmap_start_sync+0x6b/0xf0
> >
> > Can you give the following patch a test as well? I believe this is
> > the root cause why page > bitmap->pages, dm-raid is using the wrong
> > bitmap size.
> >
> > diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
> > index abe88d1e6735..d9c65ef9c9fb 100644
> > --- a/drivers/md/dm-raid.c
> > +++ b/drivers/md/dm-raid.c
> > @@ -4052,7 +4052,8 @@ static int raid_preresume(struct dm_target *ti)
> >                mddev->bitmap_info.chunksize !=
> > to_bytes(rs->requested_bitmap_chunk_sectors)))) {
> >                 int chunksize =
> > to_bytes(rs->requested_bitmap_chunk_sectors) ?:
> > mddev->bitmap_info.chunksize;
> >
> > -               r = md_bitmap_resize(mddev->bitmap,
> > mddev->dev_sectors, chunksize, 0);
> > +               r = md_bitmap_resize(mddev->bitmap,
> > mddev->resync_max_sectors,
> > +                                    chunksize, 0);
> >                 if (r)
> >                         DMERR("Failed to resize bitmap");
> >         }
> >
> > Thanks,
> > Kuai
>
> Hello Kaui,
>
> Tested and found no issues. Good to go..
>
> -Nigel

Thanks for the fixes and the tests.

For the next step, do we need both patches or just one of them?

Song