Re: Kernel 5.8 and 5.9 fail to boot on C8000

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/21/20 5:52 PM, James Bottomley wrote:
> On Tue, 2020-10-20 at 15:45 +0200, Helge Deller wrote:
>> Latest Linux kernels v5.8 and v5.9 fail to boot for me on the C8000
>> machines with this error:
>>  mptspi: probe of 0000:40:01.0 failed with error -12
>>  mptbase: ioc1: ERROR - Insufficient memory to add adapter!
>>  mptspi: probe of 0000:40:01.1 failed with error -12
>
> I think you've already figured out that this is an allocation issue.
> However, it does seem fishy, the code is
>
> 	ioc = kzalloc(sizeof(MPT_ADAPTER), GFP_KERNEL);
> 	if (ioc == NULL) {
> 		printk(KERN_ERR MYNAM ": ERROR - Insufficient memory to
> add adapter!\n");
> 		return -ENOMEM;
> 	}
>
> And MPT_ADAPTER should be just under a page which looks like a very odd
> allocation to fail so early in boot.  The memory subsystem should have
> also printed out a trace explaining why it failed the allocation.

I think there are a few issues here.
First, the allocation issue as seen above is from a current git head,
where it seems memory allocation is somewhat broken. For now I would ignore it
until git head stabilizes...

Then, in my machine I have two U320 drives, one "SEAGATE ST373307LW", and one
"HP 73.4GMAW3073NP". It seems both drives start to fail, because
even in the firmware when running "search for boot devices", they sometime
fail to be detected.

The good thing with bad drives is, that with those it's now possible to
debug error code paths in the drivers. In my case the last syslog
looks like this (I'm currently testing with Linus plain v5.9 kernel now).

+[ 1126.041880] ioc0: LSI53C1030 B2: Capabilities={Initiator,Target}
+Begin: Waiting for root file system ...
+[ 1127.069515] scsi host2: error handler thread failed to spawn, error = -4
+[ 1127.069515] mptspi: ioc0: WARNING - Unable to register controller with SCSI subsystem
+<Cpu1> 78000c6201e00000  a0e008c01100b009  CC_PAT_ENCODED_FIELD_WARNING
+<Cpu1> 76000c6801e00000  0000000000000520  CC_PAT_DATA_FIELD_WARNING
<XXX: here is something missing - serial port is often not fast enough....>
+[ 1127.069515] Backtrace:
+[ 1127.069515]  [<000000001045b7cc>] mptspi_probe+0x248/0x3d0 [mptspi]
+[ 1127.069515]  [<0000000040946470>] pci_device_probe+0x1ac/0x2d8
+[ 1127.069515]  [<0000000040add668>] really_probe+0x1bc/0x988
+[ 1127.069515]  [<0000000040ade704>] driver_probe_device+0x160/0x218
+[ 1127.069515]  [<0000000040adee24>] device_driver_attach+0x160/0x188
+[ 1127.069515]  [<0000000040adef90>] __driver_attach+0x144/0x320
+[ 1127.069515]  [<0000000040ad7c78>] bus_for_each_dev+0xd4/0x158
+[ 1127.069515]  [<0000000040adc138>] driver_attach+0x4c/0x80
+[ 1127.069515]  [<0000000040adb3ec>] bus_add_driver+0x3e0/0x498
+[ 1127.069515]  [<0000000040ae0130>] driver_register+0xf4/0x298
+[ 1127.069515]  [<00000000409450c4>] __pci_register_driver+0x78/0xa8
+[ 1127.069515]  [<000000000007d248>] mptspi_init+0x18c/0x1c4 [mptspi]
+[ 1127.069515]  [<0000000040200f18>] do_one_initcall+0x74/0x314
+[ 1127.069515]  [<00000000403528c0>] do_init_module+0xb4/0x640
+[ 1127.069515]  [<0000000040356a24>] load_module+0x3a48/0x493c
+[ 1127.069515]  [<0000000040357d58>] __do_sys_finit_module+0x120/0x1bc
+[ 1127.069515]  [<0000000040357e84>] sys_finit_module+0x30/0xa0
+[ 1127.069515]  [<0000000040210054>] syscall_exit+0x0/0x14
+[ 1127.069515]
+[ 1127.069515] Kernel Fault: Code=26 (Data memory access rights trap) at addr 00000000000007d0
+[ 1127.069515] CPU: 1 PID: 94 Comm: systemd-udevd Tainted: G            E     5.9.0-1-parisc64 #1 Debian 5.9.1-1
+[ 1127.069515] Hardware name: 9000/785/C8000
+[ 1127.069515]
+[ 1127.069515]      YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI
+[ 1127.069515] PSW: 00001000000011101111111000001111 Tainted: G            E
+[ 1127.069515] r00-03  000000ff080efe0f 000000413a6a4d60 000000000c1f8be8 000000413a6a4e00
+[ 1127.069515] r04-07  000000000c1f7000 0000004087ce3000 000000007f41e000 0000000000000000
+[ 1127.069515] r08-11  0000004087ce3000 000000001045e500 000000001045e6f8 000000004158ea68
+[ 1127.069515] r12-15  0000000000000002 0000000000000000 000000413a6a44a0 0000000040f92680
+[ 1127.069515] r16-19  0000000000000cc0 0000000000000002 000000001045eaa0 0000000005c47000
+[ 1127.069515] r20-23  000000000800000e 000000004c2ce5ae 0000000000000384 0000000000000000
+[ 1127.069515] r24-27  0000000000000143 000000000800000e 0000000000000000 000000000c1f7000
+[ 1127.069515] r28-31  00000000000005c8 000000413a6a4e70 000000413a6a4ea0 0000000041430aa0
+[ 1127.069515] sr00-03  0000000000002800 0000000000000000 0000000000000000 0000000000019000

The string "WARNING - Unable to register controller with SCSI subsystem" is
from drivers/message/fusion/mptspi.c: mptspi_probe():
        sh = scsi_host_alloc(&mptspi_driver_template, sizeof(MPT_SCSI_HOST));
        if (!sh) {
                printk(MYIOC_s_WARN_FMT
                        "Unable to register controller with SCSI subsystem\n",
                        ioc->name);
                error = -1;
                goto out_mptspi_probe;
        }

so, the kernel jumps to:
out_mptspi_probe:
        mptscsih_remove(pdev);
        return error;

Somewhere inside mptscsih_remove() the kernel crashes with a "Data memory access rights trap".
At first thought I assumed ioc->sh had an invalid value, but debugging showed that it's 0UL.
Do you have an idea what's going wrong in mptscsih_remove().
I'd expect the kernel to free all memory, ignore those drives and continue booting (and fail
later in the boot process because the root drive isn't found then).

Any idea what I could test?

Helge




[Index of Archives]     [Linux SoC]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux