Re: General Protection Fault in md raid10

Colgate Minuette <rabbit@xxxxxxxxxxxx> · Sun, 28 Apr 2024 21:30:10 -0700

On Sunday, April 28, 2024 8:12:01 PM PDT Yu Kuai wrote:
> Hi,
> 
> 在 2024/04/29 10:18, Colgate Minuette 写道:
> > On Sunday, April 28, 2024 6:02:30 PM PDT Yu Kuai wrote:
> >> Hi,
> >> 
> >> 在 2024/04/29 3:41, Colgate Minuette 写道:
> >>> Hello all,
> >>> 
> >>> I am trying to set up an md raid-10 array spanning 8 disks using the
> >>> following command
> >>> 
> >>>> mdadm --create /dev/md64 --level=10 --layout=o2 -n 8 /dev/sd[efghijkl]1
> >>> 
> >>> The raid is created successfully, but the moment that the newly created
> >>> raid starts initial sync, a general protection fault is issued. This
> >>> fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm version
> >>> 4.3. The raid is then completely unusable. After the fault, if I try to
> >>> stop the raid using>
> >>> 
> >>>> mdadm --stop /dev/md64
> >>> 
> >>> mdadm hangs indefinitely.
> >>> 
> >>> I have tried raid levels 0 and 6, and both work as expected without any
> >>> errors on these same 8 drives. I also have a working md raid-10 on the
> >>> system already with 4 disks(not related to this 8 disk array).
> >>> 
> >>> Other things I have tried include trying to create/sync the raid from a
> >>> debian live environment, and using near/far/offset layouts, but both
> >>> methods came back with the same protection fault. Also ran a memory test
> >>> on the computer, but did not have any errors after 10 passes.
> >>> 
> >>> Below is the output from the general protection fault. Let me know of
> >>> anything else to try or log information that would be helpful to
> >>> diagnose.
> >>> 
> >>> [   10.965542] md64: detected capacity change from 0 to 120021483520
> >>> [   10.965593] md: resync of RAID array md64
> >>> [   10.999289] general protection fault, probably for non-canonical
> >>> address
> >>> 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI
> >>> [   11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted
> >>> 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce
> >>> [   11.001192] Hardware name: ASUS System Product Name/TUF GAMING
> >>> X670E-PLUS WIFI, BIOS 1618 05/18/2023
> >>> [   11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260
> >>> [   11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1 0c
> >>> 48
> >>> c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff ff
> >>> <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08
> >>> [   11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216
> >>> [   11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX:
> >>> ffff89be8656a000 [   11.002628] RDX: 0000000000000642 RSI:
> >>> 000d071e7fff89be RDI: ffff89beb4039df8 [   11.002922] RBP:
> >>> ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [
> >>> 11.003217] R10: 00000000000009be R11: 0000000000002000 R12:
> >>> ffff89be8bbff400 [   11.003522] R13: ffff89beb4039a00 R14:
> >>> ffffca0a80000000 R15: 0000000000001000 [   11.003825] FS:
> >>> 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS: 0000000000000000
> >>> [   11.004126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [   11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4:
> >>> 0000000000750ee0
> >>> [   11.004737] PKRU: 55555554
> >>> [   11.005040] Call Trace:
> >>> [   11.005342]  <TASK>
> >>> [   11.005645]  ? __die_body.cold+0x1a/0x1f
> >>> [   11.005951]  ? die_addr+0x3c/0x60
> >>> [   11.006256]  ? exc_general_protection+0x1c1/0x380
> >>> [   11.006562]  ? asm_exc_general_protection+0x26/0x30
> >>> [   11.006865]  ? bio_copy_data_iter+0x187/0x260
> >>> [   11.007169]  bio_copy_data+0x5c/0x80
> >>> [   11.007474]  raid10d+0xcad/0x1c00 [raid10
> >>> 1721e6c9d579361bf112b0ce400eec9240452da1]
> >> 
> >> Can you try to use addr2line or gdb to locate which this code line
> >> is this correspond to?
> >> 
> >> I never see problem like this before... And it'll be greate if you
> >> can bisect this since you can reporduce this problem easily.
> >> 
> >> Thanks,
> >> Kuai
> > 
> > Can you provide guidance on how to do this? I haven't ever debugged kernel
> > code before. I'm assuming this would be in the raid10.ko module, but don't
> > know where to go from there.
> 
> For addr2line, you can gdb raid10.ko, then:
> 
> list *(raid10d+0xcad)
> 
> and gdb vmlinux:
> 
> list *(bio_copy_data_iter+0x187)
> 
> For git bisect, you must find a good kernel version, then:
> 
> git bisect start
> git bisect bad v6.1
> git bisect good xxx
> 
> Then git will show you how many steps are needed and choose a commit for
> you, after compile and test the kernel:
> 
> git bisect good/bad
> 
> Then git will do the bisection based on your test result, at last
> you will get a blamed commit.
> 
> Thanks,
> Kuai
> 

I don't know of any kernel that is working for this, every setup I've tried 
has had the same issue.

(gdb) list *(raid10d+0xa52)
0x6692 is in raid10d (drivers/md/raid10.c:2480).
2475    in drivers/md/raid10.c

(gdb) list *(bio_copy_data_iter+0x187)
0xffffffff814c3a77 is in bio_copy_data_iter (block/bio.c:1357).
1352    in block/bio.c

uname -a
Linux debian 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 
(2024-02-01) x86_64 GNU/Linux

-Colgate