Hi,
在 2024/05/06 20:44, Heinz Mauelshagen 写道:
Hi,
what fields are you referring to?
For this problem, the field is: dev_sectors, for raid10, it's rdev size,
for raid456, it's array size. And dm-raid is using it as bitmap size.
And while review related code, I found following quite strage as well:
mddev->resync_max_sectors = mddev->dev_sectors;
I'm still checking following fields now, for both md/raid and dm-raid,
with the respect how sync_thread should work:
dev_sectors
resync_max_sectors
array_sectors
recovery_cp
recovery_offset
reshape_position
Thanks,
Kuai
Thanks,
Heinz
On Mon, May 6, 2024 at 8:19 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx
<mailto:yukuai1@xxxxxxxxxxxxxxx>> wrote:
Hi,
在 2024/04/30 19:07, Nigel Croxon 写道:
>
> On 4/25/24 12:52 PM, Song Liu wrote:
>> On Thu, Apr 25, 2024 at 5:10 AM Nigel Croxon <ncroxon@xxxxxxxxxx
<mailto:ncroxon@xxxxxxxxxx>> wrote:
>>>
>>> On 4/24/24 2:57 AM, Yu Kuai wrote:
>>>> Hi, Nigel
>>>>
>>>> 在 2024/04/21 20:30, Nigel Croxon 写道:
>>>>> On 4/20/24 2:09 AM, Yu Kuai wrote:
>>>>>> Hi,
>>>>>>
>>>>>> 在 2024/04/20 3:49, Nigel Croxon 写道:
>>>>>>> There is a problem with this commit, it causes a CPU#x soft
lockup
>>>>>>>
>>>>>>> commit 301867b1c16805aebbc306aafa6ecdc68b73c7e5
>>>>>>> Author: Li Nan <linan122@xxxxxxxxxx
<mailto:linan122@xxxxxxxxxx>>
>>>>>>> Date: Mon May 15 21:48:05 2023 +0800
>>>>>>> md/raid10: check slab-out-of-bounds in md_bitmap_get_counter
>>>>>>>
>>>>>> Did you found this commit by bisect?
>>>>>>
>>>>> Yes, found this issue by bisecting...
>>>>>
>>>>>>> Message from syslogd@rhel9 at Apr 19 14:14:55 ...
>>>>>>> kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 26s!
>>>>>>> [mdX_resync:6976]
>>>>>>>
>>>>>>> dmesg:
>>>>>>>
>>>>>>> [ 104.245585] CPU: 7 PID: 3588 Comm: mdX_resync Kdump:
loaded Not
>>>>>>> tainted 6.9.0-rc4-next-20240419 #1
>>>>>>> [ 104.245588] Hardware name: QEMU Standard PC (Q35 + ICH9,
2009),
>>>>>>> BIOS 1.16.2-1.fc38 04/01/2014
>>>>>>> [ 104.245590] RIP: 0010:_raw_spin_unlock_irq+0x13/0x30
>>>>>>> [ 104.245598] Code: 00 00 00 00 00 66 90 90 90 90 90 90 90
90 90
>>>>>>> 90 90 90 90 90 90 90 90 0f 1f 44 00 00 c6 07 00 90 90 90 fb
65 ff
>>>>>>> 0d 95 9f 75 76 <74> 05 c3 cc cc cc cc 0f 1f 44 00 00 c3 cc
cc cc cc
>>>>>>> cc cc cc cc cc
>>>>>>> [ 104.245601] RSP: 0018:ffffb2d74a81bbf8 EFLAGS: 00000246
>>>>>>> [ 104.245603] RAX: 0000000000000000 RBX: 0000000001000000 RCX:
>>>>>>> 000000000000000c
>>>>>>> [ 104.245604] RDX: 0000000000000000 RSI: 0000000001000000 RDI:
>>>>>>> ffff926160ccd200
>>>>>>> [ 104.245606] RBP: ffffb2d74a81bcd0 R08: 0000000000000013 R09:
>>>>>>> 0000000000000000
>>>>>>> [ 104.245607] R10: 0000000000000000 R11: ffffb2d74a81bad8 R12:
>>>>>>> 0000000000000000
>>>>>>> [ 104.245608] R13: 0000000000000000 R14: ffff926160ccd200 R15:
>>>>>>> ffff926151019000
>>>>>>> [ 104.245611] FS: 0000000000000000(0000)
>>>>>>> GS:ffff9273f9580000(0000) knlGS:0000000000000000
>>>>>>> [ 104.245613] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
>>>>>>> [ 104.245614] CR2: 00007f23774d2584 CR3: 0000000104098003 CR4:
>>>>>>> 0000000000370ef0
>>>>>>> [ 104.245616] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>>>>>>> 0000000000000000
>>>>>>> [ 104.245617] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>>>>>>> 0000000000000400
>>>>>>> [ 104.245618] Call Trace:
>>>>>>> [ 104.245620] <IRQ>
>>>>>>> [ 104.245623] ? watchdog_timer_fn+0x1e3/0x260
>>>>>>> [ 104.245630] ? __pfx_watchdog_timer_fn+0x10/0x10
>>>>>>> [ 104.245634] ? __hrtimer_run_queues+0x112/0x2a0
>>>>>>> [ 104.245638] ? hrtimer_interrupt+0xff/0x240
>>>>>>> [ 104.245640] ? sched_clock+0xc/0x30
>>>>>>> [ 104.245644] ? __sysvec_apic_timer_interrupt+0x54/0x140
>>>>>>> [ 104.245649] ? sysvec_apic_timer_interrupt+0x6c/0x90
>>>>>>> [ 104.245652] </IRQ>
>>>>>>> [ 104.245653] <TASK>
>>>>>>> [ 104.245654] ? asm_sysvec_apic_timer_interrupt+0x16/0x20
>>>>>>> [ 104.245659] ? _raw_spin_unlock_irq+0x13/0x30
>>>>>>> [ 104.245661] md_bitmap_start_sync+0x6b/0xf0
>>>> Can you give the following patch a test as well? I believe this is
>>>> the root cause why page > bitmap->pages, dm-raid is using the
wrong
>>>> bitmap size.
>>>>
>>>> diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
>>>> index abe88d1e6735..d9c65ef9c9fb 100644
>>>> --- a/drivers/md/dm-raid.c
>>>> +++ b/drivers/md/dm-raid.c
>>>> @@ -4052,7 +4052,8 @@ static int raid_preresume(struct
dm_target *ti)
>>>> mddev->bitmap_info.chunksize !=
>>>> to_bytes(rs->requested_bitmap_chunk_sectors)))) {
>>>> int chunksize =
>>>> to_bytes(rs->requested_bitmap_chunk_sectors) ?:
>>>> mddev->bitmap_info.chunksize;
>>>>
>>>> - r = md_bitmap_resize(mddev->bitmap,
>>>> mddev->dev_sectors, chunksize, 0);
>>>> + r = md_bitmap_resize(mddev->bitmap,
>>>> mddev->resync_max_sectors,
>>>> + chunksize, 0);
>>>> if (r)
>>>> DMERR("Failed to resize bitmap");
>>>> }
>>>>
>>>> Thanks,
>>>> Kuai
>>> Hello Kaui,
>>>
>>> Tested and found no issues. Good to go..
>>>
>>> -Nigel
>> Thanks for the fixes and the tests.
>>
>> For the next step, do we need both patches or just one of them?
>>
>> Song
>>
> They both fix the problem independently without the other.
Sorry that I forgot to reply here, we discussed this on slack...
For md/raid, we already apply the first patch to fix the soft lockup
problem, for dm-raid, other than the second patch to fix wrong bitmap
size, we still need more changes, because some fields in mddev for
dm-raid10 and dm-raid5 are different, while dm-raid doesn't distinguish
them. I'm working on that, however, I'm not that familiar with dm-raid
and I need more time. :)
Thanks,
Kuai
>
> -Nigel
>
> .
>