* Helge Deller <deller@xxxxxx>: > On 10/21/20 5:52 PM, James Bottomley wrote: > > On Tue, 2020-10-20 at 15:45 +0200, Helge Deller wrote: > >> Latest Linux kernels v5.8 and v5.9 fail to boot for me on the C8000 > >> machines with this error: > >> mptspi: probe of 0000:40:01.0 failed with error -12 > >> mptbase: ioc1: ERROR - Insufficient memory to add adapter! > >> mptspi: probe of 0000:40:01.1 failed with error -12 > > > > I think you've already figured out that this is an allocation issue. > > However, it does seem fishy, the code is > > > > ioc = kzalloc(sizeof(MPT_ADAPTER), GFP_KERNEL); > > if (ioc == NULL) { > > printk(KERN_ERR MYNAM ": ERROR - Insufficient memory to > > add adapter!\n"); > > return -ENOMEM; > > } > > > > And MPT_ADAPTER should be just under a page which looks like a very odd > > allocation to fail so early in boot. The memory subsystem should have > > also printed out a trace explaining why it failed the allocation. > > I think there are a few issues here. > First, the allocation issue as seen above is from a current git head, > where it seems memory allocation is somewhat broken. For now I would ignore it > until git head stabilizes... > > Then, in my machine I have two U320 drives, one "SEAGATE ST373307LW", and one > "HP 73.4GMAW3073NP". It seems both drives start to fail, because > even in the firmware when running "search for boot devices", they sometime > fail to be detected. > > The good thing with bad drives is, that with those it's now possible to > debug error code paths in the drivers. In my case the last syslog > looks like this (I'm currently testing with Linus plain v5.9 kernel now). > > +[ 1126.041880] ioc0: LSI53C1030 B2: Capabilities={Initiator,Target} > +Begin: Waiting for root file system ... > +[ 1127.069515] scsi host2: error handler thread failed to spawn, error = -4 > +[ 1127.069515] mptspi: ioc0: WARNING - Unable to register controller with SCSI subsystem > +<Cpu1> 78000c6201e00000 a0e008c01100b009 CC_PAT_ENCODED_FIELD_WARNING > +<Cpu1> 76000c6801e00000 0000000000000520 CC_PAT_DATA_FIELD_WARNING > <XXX: here is something missing - serial port is often not fast enough....> > +[ 1127.069515] Backtrace: > +[ 1127.069515] [<000000001045b7cc>] mptspi_probe+0x248/0x3d0 [mptspi] > +[ 1127.069515] [<0000000040946470>] pci_device_probe+0x1ac/0x2d8 > +[ 1127.069515] [<0000000040add668>] really_probe+0x1bc/0x988 > +[ 1127.069515] [<0000000040ade704>] driver_probe_device+0x160/0x218 > +[ 1127.069515] [<0000000040adee24>] device_driver_attach+0x160/0x188 > +[ 1127.069515] [<0000000040adef90>] __driver_attach+0x144/0x320 > +[ 1127.069515] [<0000000040ad7c78>] bus_for_each_dev+0xd4/0x158 > +[ 1127.069515] [<0000000040adc138>] driver_attach+0x4c/0x80 > +[ 1127.069515] [<0000000040adb3ec>] bus_add_driver+0x3e0/0x498 > +[ 1127.069515] [<0000000040ae0130>] driver_register+0xf4/0x298 > +[ 1127.069515] [<00000000409450c4>] __pci_register_driver+0x78/0xa8 > +[ 1127.069515] [<000000000007d248>] mptspi_init+0x18c/0x1c4 [mptspi] > +[ 1127.069515] [<0000000040200f18>] do_one_initcall+0x74/0x314 > +[ 1127.069515] [<00000000403528c0>] do_init_module+0xb4/0x640 > +[ 1127.069515] [<0000000040356a24>] load_module+0x3a48/0x493c > +[ 1127.069515] [<0000000040357d58>] __do_sys_finit_module+0x120/0x1bc > +[ 1127.069515] [<0000000040357e84>] sys_finit_module+0x30/0xa0 > +[ 1127.069515] [<0000000040210054>] syscall_exit+0x0/0x14 > +[ 1127.069515] > +[ 1127.069515] Kernel Fault: Code=26 (Data memory access rights trap) at addr 00000000000007d0 > +[ 1127.069515] CPU: 1 PID: 94 Comm: systemd-udevd Tainted: G E 5.9.0-1-parisc64 #1 Debian 5.9.1-1 > +[ 1127.069515] Hardware name: 9000/785/C8000 > +[ 1127.069515] > +[ 1127.069515] YZrvWESTHLNXBCVMcbcbcbcbOGFRQPDI > +[ 1127.069515] PSW: 00001000000011101111111000001111 Tainted: G E > +[ 1127.069515] r00-03 000000ff080efe0f 000000413a6a4d60 000000000c1f8be8 000000413a6a4e00 > +[ 1127.069515] r04-07 000000000c1f7000 0000004087ce3000 000000007f41e000 0000000000000000 > +[ 1127.069515] r08-11 0000004087ce3000 000000001045e500 000000001045e6f8 000000004158ea68 > +[ 1127.069515] r12-15 0000000000000002 0000000000000000 000000413a6a44a0 0000000040f92680 > +[ 1127.069515] r16-19 0000000000000cc0 0000000000000002 000000001045eaa0 0000000005c47000 > +[ 1127.069515] r20-23 000000000800000e 000000004c2ce5ae 0000000000000384 0000000000000000 > +[ 1127.069515] r24-27 0000000000000143 000000000800000e 0000000000000000 000000000c1f7000 > +[ 1127.069515] r28-31 00000000000005c8 000000413a6a4e70 000000413a6a4ea0 0000000041430aa0 > +[ 1127.069515] sr00-03 0000000000002800 0000000000000000 0000000000000000 0000000000019000 > > The string "WARNING - Unable to register controller with SCSI subsystem" is > from drivers/message/fusion/mptspi.c: mptspi_probe(): > sh = scsi_host_alloc(&mptspi_driver_template, sizeof(MPT_SCSI_HOST)); > if (!sh) { > printk(MYIOC_s_WARN_FMT > "Unable to register controller with SCSI subsystem\n", > ioc->name); > error = -1; > goto out_mptspi_probe; > } > > so, the kernel jumps to: > out_mptspi_probe: > mptscsih_remove(pdev); > return error; > > Somewhere inside mptscsih_remove() the kernel crashes with a "Data memory access rights trap". > At first thought I assumed ioc->sh had an invalid value, but debugging showed that it's 0UL. > Do you have an idea what's going wrong in mptscsih_remove(). > I'd expect the kernel to free all memory, ignore those drives and continue booting (and fail > later in the boot process because the root drive isn't found then). Everyone can trigger the fault (on any architecture) by this patch: diff --git a/drivers/message/fusion/mptspi.c b/drivers/message/fusion/mptspi.c index eabc4de5816c..1f26ecea4c95 100644 --- a/drivers/message/fusion/mptspi.c +++ b/drivers/message/fusion/mptspi.c @@ -1404,6 +1404,7 @@ mptspi_probe(struct pci_dev *pdev, const struct pci_device_id *id) } sh = scsi_host_alloc(&mptspi_driver_template, sizeof(MPT_SCSI_HOST)); + sh = NULL; if (!sh) { printk(MYIOC_s_WARN_FMT With the patch below the driver now cleanly exits: [ 1119.508147] Fusion MPT base driver 3.04.20 [ 1119.508147] Copyright (c) 1999-2008 LSI Corporation [ 1119.508147] Fusion MPT SPI Host driver 3.04.20 [ 1119.508147] mptbase: ioc0: Initiating bringup [ 1119.508147] sr 1:0:0:0: [sr0] scsi3-mmc drive: 40x/40x cd/rw xa/form2 cdda tray [ 1119.508147] cdrom: Uniform CD-ROM driver Revision: 3.20 [ 1119.508147] ioc0: LSI53C1030 B2: Capabilities={Initiator,Target} [ 1121.512619] mptspi: ioc0: WARNING - Unable to register controller with SCSI subsystem [ 1121.512619] mptspi: probe of 0000:40:01.0 failed with error -1 [ 1121.512619] mptbase: ioc1: Initiating bringup [ 1122.508645] ioc1: LSI53C1030 B2: Capabilities={Initiator,Target} [ 1122.508645] mptspi: ioc1: WARNING - Unable to register controller with SCSI subsystem [ 1123.417139] mptspi: probe of 0000:40:01.1 failed with error -1 [ 1123.487494] Fusion MPT FC Host driver 3.04.20 [ 1123.487494] Fusion MPT SAS Host driver 3.04.20 [ 1123.487494] Fusion MPT misc device (ioctl) driver 3.04.20 [ 1123.487494] mptctl: Registered with Fusion MPT base driver [ 1123.487494] mptctl: /dev/mptctl @ (major,minor=10,220) I'll send this patch to the scsi mailing list shortly: [PATCH] scsi: mptfusion: Fix error paths in mptscsih_remove() Signed-off-by: Helge Deller <deller@xxxxxx> diff --git a/drivers/message/fusion/mptscsih.c b/drivers/message/fusion/mptscsih.c index 8543f0324d5a..0d1b2b0eb843 100644 --- a/drivers/message/fusion/mptscsih.c +++ b/drivers/message/fusion/mptscsih.c @@ -1176,8 +1176,10 @@ mptscsih_remove(struct pci_dev *pdev) MPT_SCSI_HOST *hd; int sz1; - if((hd = shost_priv(host)) == NULL) - return; + if (host == NULL) + hd = NULL; + else + hd = shost_priv(host); mptscsih_shutdown(pdev); @@ -1193,14 +1195,15 @@ mptscsih_remove(struct pci_dev *pdev) "Free'd ScsiLookup (%d) memory\n", ioc->name, sz1)); - kfree(hd->info_kbuf); + if (hd) + kfree(hd->info_kbuf); /* NULL the Scsi_Host pointer */ ioc->sh = NULL; - scsi_host_put(host); - + if (host) + scsi_host_put(host); mpt_detach(pdev); }