On Sunday, April 28, 2024 11:06:51 PM PDT Yu Kuai wrote: > Hi, > > 在 2024/04/29 12:30, Colgate Minuette 写道: > > On Sunday, April 28, 2024 8:12:01 PM PDT Yu Kuai wrote: > >> Hi, > >> > >> 在 2024/04/29 10:18, Colgate Minuette 写道: > >>> On Sunday, April 28, 2024 6:02:30 PM PDT Yu Kuai wrote: > >>>> Hi, > >>>> > >>>> 在 2024/04/29 3:41, Colgate Minuette 写道: > >>>>> Hello all, > >>>>> > >>>>> I am trying to set up an md raid-10 array spanning 8 disks using the > >>>>> following command > >>>>> > >>>>>> mdadm --create /dev/md64 --level=10 --layout=o2 -n 8 > >>>>>> /dev/sd[efghijkl]1 > >>>>> > >>>>> The raid is created successfully, but the moment that the newly > >>>>> created > >>>>> raid starts initial sync, a general protection fault is issued. This > >>>>> fault happens on kernels 6.1.85, 6.6.26, and 6.8.5 using mdadm version > >>>>> 4.3. The raid is then completely unusable. After the fault, if I try > >>>>> to > >>>>> stop the raid using> > >>>>> > >>>>>> mdadm --stop /dev/md64 > >>>>> > >>>>> mdadm hangs indefinitely. > >>>>> > >>>>> I have tried raid levels 0 and 6, and both work as expected without > >>>>> any > >>>>> errors on these same 8 drives. I also have a working md raid-10 on the > >>>>> system already with 4 disks(not related to this 8 disk array). > >>>>> > >>>>> Other things I have tried include trying to create/sync the raid from > >>>>> a > >>>>> debian live environment, and using near/far/offset layouts, but both > >>>>> methods came back with the same protection fault. Also ran a memory > >>>>> test > >>>>> on the computer, but did not have any errors after 10 passes. > >>>>> > >>>>> Below is the output from the general protection fault. Let me know of > >>>>> anything else to try or log information that would be helpful to > >>>>> diagnose. > >>>>> > >>>>> [ 10.965542] md64: detected capacity change from 0 to 120021483520 > >>>>> [ 10.965593] md: resync of RAID array md64 > >>>>> [ 10.999289] general protection fault, probably for non-canonical > >>>>> address > >>>>> 0xd071e7fff89be: 0000 [#1] PREEMPT SMP NOPTI > >>>>> [ 11.000842] CPU: 4 PID: 912 Comm: md64_raid10 Not tainted > >>>>> 6.1.85-1-MANJARO #1 44ae6c380f5656fa036749a28fdade8f34f2f9ce > >>>>> [ 11.001192] Hardware name: ASUS System Product Name/TUF GAMING > >>>>> X670E-PLUS WIFI, BIOS 1618 05/18/2023 > >>>>> [ 11.001482] RIP: 0010:bio_copy_data_iter+0x187/0x260 > >>>>> [ 11.001756] Code: 29 f1 4c 29 f6 48 c1 f9 06 48 c1 fe 06 48 c1 e1 > >>>>> 0c > >>>>> 48 > >>>>> c1 e6 0c 48 01 e9 48 01 ee 48 01 d9 4c 01 d6 83 fa 08 0f 82 b0 fe ff > >>>>> ff > >>>>> <48> 8b 06 48 89 01 89 d0 48 8b 7c 06 f8 48 89 7c 01 f8 48 8d 79 08 > >>>>> [ 11.002045] RSP: 0018:ffffa838124ffd28 EFLAGS: 00010216 > >>>>> [ 11.002336] RAX: ffffca0a84195a80 RBX: 0000000000000000 RCX: > >>>>> ffff89be8656a000 [ 11.002628] RDX: 0000000000000642 RSI: > >>>>> 000d071e7fff89be RDI: ffff89beb4039df8 [ 11.002922] RBP: > >>>>> ffff89bd80000000 R08: ffffa838124ffd74 R09: ffffa838124ffd60 [ > >>>>> 11.003217] R10: 00000000000009be R11: 0000000000002000 R12: > >>>>> ffff89be8bbff400 [ 11.003522] R13: ffff89beb4039a00 R14: > >>>>> ffffca0a80000000 R15: 0000000000001000 [ 11.003825] FS: > >>>>> 0000000000000000(0000) GS:ffff89c5b8700000(0000) knlGS: > >>>>> 0000000000000000 > >>>>> [ 11.004126] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >>>>> [ 11.004429] CR2: 0000563308baac38 CR3: 000000012e900000 CR4: > >>>>> 0000000000750ee0 > >>>>> [ 11.004737] PKRU: 55555554 > >>>>> [ 11.005040] Call Trace: > >>>>> [ 11.005342] <TASK> > >>>>> [ 11.005645] ? __die_body.cold+0x1a/0x1f > >>>>> [ 11.005951] ? die_addr+0x3c/0x60 > >>>>> [ 11.006256] ? exc_general_protection+0x1c1/0x380 > >>>>> [ 11.006562] ? asm_exc_general_protection+0x26/0x30 > >>>>> [ 11.006865] ? bio_copy_data_iter+0x187/0x260 > >>>>> [ 11.007169] bio_copy_data+0x5c/0x80 > >>>>> [ 11.007474] raid10d+0xcad/0x1c00 [raid10 > >>>>> 1721e6c9d579361bf112b0ce400eec9240452da1] > >>>> > >>>> Can you try to use addr2line or gdb to locate which this code line > >>>> is this correspond to? > >>>> > >>>> I never see problem like this before... And it'll be greate if you > >>>> can bisect this since you can reporduce this problem easily. > >>>> > >>>> Thanks, > >>>> Kuai > >>> > >>> Can you provide guidance on how to do this? I haven't ever debugged > >>> kernel > >>> code before. I'm assuming this would be in the raid10.ko module, but > >>> don't > >>> know where to go from there. > >> > >> For addr2line, you can gdb raid10.ko, then: > >> > >> list *(raid10d+0xcad) > >> > >> and gdb vmlinux: > >> > >> list *(bio_copy_data_iter+0x187) > >> > >> For git bisect, you must find a good kernel version, then: > >> > >> git bisect start > >> git bisect bad v6.1 > >> git bisect good xxx > >> > >> Then git will show you how many steps are needed and choose a commit for > >> you, after compile and test the kernel: > >> > >> git bisect good/bad > >> > >> Then git will do the bisection based on your test result, at last > >> you will get a blamed commit. > >> > >> Thanks, > >> Kuai > > > > I don't know of any kernel that is working for this, every setup I've > > tried > > has had the same issue. > > This's really wried, is this the first time you ever using raid10? Did > you try some older kernel like v5.10 or v4.19? > I have been using md raid10 on this system for about 10 years with a different set of disks with no issues. The other raid10 that is in place is SATA drives, but I have created and tested a raid10 with different SAS drives on this system, and had no issues with that test. These Samsung SSDs are a new addition to the system. I'll try the raid10 on 4.19.307 and 5.10.211 as well, since those are in my distro's repos. -Colgate > > (gdb) list *(raid10d+0xa52) > > 0x6692 is in raid10d (drivers/md/raid10.c:2480). > > 2475 in drivers/md/raid10.c > > > > (gdb) list *(bio_copy_data_iter+0x187) > > 0xffffffff814c3a77 is in bio_copy_data_iter (block/bio.c:1357). > > 1352 in block/bio.c > > Thanks for this, I'll try to take a look at related code. > > Kuai > > > uname -a > > Linux debian 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 > > (2024-02-01) x86_64 GNU/Linux > > > > -Colgate > >