Re: General Protection Fault in md raid10

Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> · Mon, 29 Apr 2024 11:12:01 +0800

Hi,

在 2024/04/29 10:18, Colgate Minuette 写道:
On Sunday, April 28, 2024 6:02:30 PM PDT Yu Kuai wrote:
Hi,

在 2024/04/29 3:41, Colgate Minuette 写道:
Hello all,

I am trying to set up an md raid-10 array spanning 8 disks using the
following command

mdadm --create /dev/md64 --level=10 --layout=o2 -n 8 /dev/sd[efghijkl]1

The raid is created successfully, but the moment that the newly created
raid starts initial sync, a general protection fault is issued. This
fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm version
4.3. The raid is then completely unusable. After the fault, if I try to
stop the raid using>
mdadm --stop /dev/md64

mdadm hangs indefinitely.

I have tried raid levels 0 and 6, and both work as expected without any
errors on these same 8 drives. I also have a working md raid-10 on the
system already with 4 disks(not related to this 8 disk array).

Other things I have tried include trying to create/sync the raid from a
debian live environment, and using near/far/offset layouts, but both
methods came back with the same protection fault. Also ran a memory test
on the computer, but did not have any errors after 10 passes.

Below is the output from the general protection fault. Let me know of
anything else to try or log information that would be helpful to
diagnose.

[   10.965542] md64: detected capacity change from 0 to 120021483520
[   10.965593] md: resync of RAID array md64
[   10.999289] general protection fault, probably for non-canonical
address
0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
[   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
[   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
X670E-PLUS WIFI, BIOS 1618 05/18/2023
[   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
[   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1 0c 48
c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff ff
<48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
[   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
[   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
ffff89be8656a000 [   11.002628] RDX: 0000000000000642 RSI:
000d071e7fff89be RDI: ffff89beb4039df8 [   11.002922] RBP:
ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [
11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
ffff89be8bbff400 [   11.003522] R13: ffff89beb4039a00 R14:
ffffca0a80000000 R15: 0000000000001000 [   11.003825] FS:
0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS: 0000000000000000
[   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4:
0000000000750ee0
[   11.004737] PKRU: 55555554
[   11.005040] Call Trace:
[   11.005342]  <TASK>
[   11.005645]  ? __die_body.cold+0x1a/0x1f
[   11.005951]  ? die_addr+0x3c/0x60
[   11.006256]  ? exc_general_protection+0x1c1/0x380
[   11.006562]  ? asm_exc_general_protection+0x26/0x30
[   11.006865]  ? bio_copy_data_iter+0x187/0x260
[   11.007169]  bio_copy_data+0x5c/0x80
[   11.007474]  raid10d+0xcad/0x1c00 [raid10
1721e6c9d579361bf112b0ce400eec9240452da1]

Can you try to use addr2line or gdb to locate which this code line
is this correspond to?

I never see problem like this before... And it'll be greate if you
can bisect this since you can reporduce this problem easily.

Thanks,
Kuai

Can you provide guidance on how to do this? I haven't ever debugged kernel
code before. I'm assuming this would be in the raid10.ko module, but don't
know where to go from there.

For addr2line, you can gdb raid10.ko, then:

list *(raid10d+0xcad)

and gdb vmlinux:

list *(bio_copy_data_iter+0x187)

For git bisect, you must find a good kernel version, then:

git bisect start
git bisect bad v6.1
git bisect good xxx

Then git will show you how many steps are needed and choose a commit for
you, after compile and test the kernel:

git bisect good/bad

Then git will do the bisection based on your test result, at last
you will get a blamed commit.

Thanks,
Kuai

-Colgate

.