Re: megaraid_sas problem for scsi_add_host() fail

Sumit Saxena <sumit.saxena@xxxxxxxxxxxx> · Mon, 2 Mar 2020 23:28:54 +0530



On Mon, Mar 2, 2020 at 5:43 PM John Garry <john.garry@xxxxxxxxxx> wrote:
>
>
>
> Hi Sumit,
>
> > It is megaraid_sas driver bug. Driver does not freeup resources properly, when
> > scsi_add_host() fails. Please try attached patch.
>
> Yeah, that looks to work. The driver gracefully failed to bind.
Ok, I will send this patch to upstream.
>
> However we might have lots of memory leaks:
Thanks for pointing it out John. I will look into these memory leaks and send
patches with fix in the next few days.

Thanks,
Sumit

>
>
> root@(none)$ echo scan > /sys/kernel/debug/kmemleak
> root@(none)$ [  140.585484] kmemleak: 259 new suspected memory leaks
> (see /sys/kernel/debug/kmemleak)
> [  140.585484] kmemleak: 259 new suspected memory leaks (see
> /sys/kernel/debug/kmemleak)
>
> root@(none)$
> root@(none)$ more /sys/kernel/debug/kmemleak
> unreferenced object 0xffff0026b9184c00 (size 512):
>    comm "kworker/0:0", pid 5, jiffies 4294903201 (age 95.768s)
>    hex dump (first 32 bytes):
>      60 00 00 00 61 00 00 00 62 00 00 00 63 00 00 00  `...a...b...c...
>      64 00 00 00 65 00 00 00 66 00 00 00 67 00 00 00  d...e...f...g...
>    backtrace:
>      [<(____ptrval____)>] slab_post_alloc_hook+0x6c/0xa0
>      [<(____ptrval____)>] __kmalloc+0x174/0x280
>      [<(____ptrval____)>] megasas_probe_one+0x798/0x2878
>      [<(____ptrval____)>] local_pci_probe+0x74/0xf0
>      [<(____ptrval____)>] work_for_cpu_fn+0x2c/0x48
>      [<(____ptrval____)>] process_one_work+0x488/0xc08
>      [<(____ptrval____)>] worker_thread+0x330/0x5d0
>      [<(____ptrval____)>] kthread+0x1c8/0x1d0
>      [<(____ptrval____)>] ret_from_fork+0x10/0x18
> unreferenced object 0xffff0026b922c000 (size 4096):
>    comm "kworker/0:0", pid 5, jiffies 4294903201 (age 95.768s)
>    hex dump (first 32 bytes):
>      00 00 21 b7 26 00 ff ff 00 00 9f ff 00 00 00 00  ..!.&...........
>      00 10 22 10 00 a0 ff ff 00 00 00 00 00 00 00 00  ..".............
>    backtrace:
>      [<(____ptrval____)>] slab_post_alloc_hook+0x6c/0xa0
>      [<(____ptrval____)>] kmem_cache_alloc_trace+0x140/0x228
>      [<(____ptrval____)>] megasas_alloc_fusion_context+0x30/0x1b0
>      [<(____ptrval____)>] megasas_probe_one+0x7d8/0x2878
>      [<(____ptrval____)>] local_pci_probe+0x74/0xf0
>      [<(____ptrval____)>] work_for_cpu_fn+0x2c/0x48
>      [<(____ptrval____)>] process_one_work+0x488/0xc08
>      [<(____ptrval____)>] worker_thread+0x330/0x5d0
>      [<(____ptrval____)>] kthread+0x1c8/0x1d0
>      [<(____ptrval____)>] ret_from_fork+0x10/0x18
> unreferenced object 0xffff0026b7013000 (size 2048):
>    comm "kworker/0:0", pid 5, jiffies 4294903512 (age 94.540s)
>    hex dump (first 32 bytes):
>      00 58 18 b9 26 00 ff ff 00 5c 18 b9 26 00 ff ff  .X..&....\..&...
>      00 60 18 b9 26 00 ff ff 00 64 18 b9 26 00 ff ff  .`..&....d..&...
>    backtrace:
>      [<(____ptrval____)>] slab_post_alloc_hook+0x6c/0xa0
>      [<(____ptrval____)>] kmem_cache_alloc_trace+0x140/0x228
> root@(none)$
>
>
> Thanks,
> John
>
> >
> > Thanks,
> > Sumit
> >>
> >> [   62.516871] megasas: 07.713.01.00-rc1
> >> [   62.526189] megaraid_sas 0000:08:00.0: Adding to iommu group 1
> >> [   62.571790] megaraid_sas 0000:08:00.0: BAR:0x0  BAR's
> >> base_addr(phys):0x0000080010000000  mapped virt_addr:0x(____ptrval____)
> >> [   62.571802] megaraid_sas 0000:08:00.0: FW now in Ready state
> >> [   62.583811] megaraid_sas 0000:08:00.0: 63 bit DMA mask and 63 bit
> >> consistent mask
> >> [   62.602143] megaraid_sas 0000:08:00.0: firmware supports msix : (128)
> >> [   62.780250] megaraid_sas 0000:08:00.0: requested/available msix 128/128
> >> [   62.794292] megaraid_sas 0000:08:00.0: current msix/online cpus :
> >> (128/128)
> >> [   62.809011] megaraid_sas 0000:08:00.0: RDPQ mode : (enabled)
> >> [   62.820968] megaraid_sas 0000:08:00.0: Current firmware supports
> >> maximum commands: 4077 LDIO threshold: 0
> >> [   62.937043] megaraid_sas 0000:08:00.0: Configured max firmware
> >> commands: 4076
> >> [   63.509185] megaraid_sas 0000:08:00.0: Performance mode :Latency
> >> [   63.521906] megaraid_sas 0000:08:00.0: FW supports sync cache : Yes
> >> [   63.535148] megaraid_sas 0000:08:00.0: megasas_disable_intr_fusion is
> >> called outbound_intr_mask:0x40000009
> >> [   63.610607] megaraid_sas 0000:08:00.0: FW provided supportMaxExtLDs:
> >> 1 max_lds: 64
> >> [   63.626618] megaraid_sas 0000:08:00.0: controller type : MR(2048MB)
> >> [   63.639870] megaraid_sas 0000:08:00.0: Online Controller Reset(OCR) :
> >> Enabled
> >> [   63.654945] megaraid_sas 0000:08:00.0: Secure JBOD support : Yes
> >> [   63.667661] megaraid_sas 0000:08:00.0: NVMe passthru support : Yes
> >> [   63.667672] megaraid_sas 0000:08:00.0: FW provided TM TaskAbort/Reset
> >> timeout : 6 secs/60 secs
> >> [   63.698922] megaraid_sas 0000:08:00.0: JBOD sequence map support : Yes
> >> [   63.712715] megaraid_sas 0000:08:00.0: PCI Lane Margining support : No
> >> [   63.754764] megaraid_sas 0000:08:00.0: NVME page size : (4096)
> >> [   63.787258] megaraid_sas 0000:08:00.0: megasas_enable_intr_fusion is
> >> called outbound_intr_mask:0x40000000
> >> [   63.807485] megaraid_sas 0000:08:00.0: INIT adapter done
> >> [   63.822235] megaraid_sas 0000:08:00.0: pci id :
> >> (0x1000)/(0x0016)/(0x19e5)/(0xd215)
> >> [   63.838652] megaraid_sas 0000:08:00.0: unevenspan support : no
> >> [   63.850980] megaraid_sas 0000:08:00.0: firmware crash dump : no
> >> [   63.863499] megaraid_sas 0000:08:00.0: JBOD sequence map : enabled
> >> [   63.877352] scsi host0: Avago SAS based MegaRAID driver
> >> [   63.890398] megaraid_sas 0000:08:00.0: Failed to add host from
> >> megasas_io_attach 6802
> >> [   63.906999] megaraid_sas 0000:08:00.0: megasas_disable_intr_fusion is
> >> called outbound_intr_mask:0x40000009
> >> [   64.591755] nvme 0000:81:00.0: Adding to iommu group 2
> >> [   64.636476] nvme nvme0: pci function 0000:81:00.0
> >> [   64.669635] libphy: Fixed MDIO Bus: probed
> >> [   64.680255] tun: Universal TUN/TAP device driver, 1.6
> >> [   64.694422] thunder_xcv, ver 1.0
> >> [   64.702042] thunder_bgx, ver 1.0
> >> [   64.709277] nicpf, ver 1.0
> >> [   64.718144] e1000e: Intel(R) PRO/1000 Network Driver - 3.2.6-k
> >> [   64.730402] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
> >> [   64.743337] igb: Intel(R) Gigabit Ethernet Network Driver - version
> >> 5.6.0-k
> >> [   64.754981] nvme nvme0: Removing after probe failure status: -12
> >> [   64.757953] igb: Copyright (c) 2007-2014 Intel Corporation.
> >> [   64.782805] igbvf: Intel(R) Gigabit Virtual Function Network Driver -
> >> version 2.4.0-k
> >> [   64.799423] igbvf: Copyright (c) 2009 - 2012 Intel Corporation.
> >> [   64.813848] sky2: driver version 1.30
> >> [   64.825564] VFIO - User Level meta-driver version: 0.3
> >> [   64.848089] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
> >> [   64.862029] ehci-pci: EHCI PCI platform driver
> >> [   64.873445] ehci-pci 0000:7a:01.0: Adding to iommu group 3
> >> [   64.886700]
> >> ==================================================================
> >> [   64.901999] BUG: KASAN: slab-out-of-bounds in
> >> run_timer_softirq+0x6f4/0xae0
> >> [   64.916663] Write of size 8 at addr ffff0026b931aae0 by task swapper/0/0
> >>
> >> [   64.933914] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
> >> 5.6.0-rc3-00005-g17ceebe3a05c-dirty #1775
> >> [   64.952240] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD,
> >> BIOS 2280-V2 CS V3.B160.01 02/24/2020
> >> [   64.972575] Call trace:
> >> [   64.977729]  dump_backtrace+0x0/0x298
> >> [   64.985439]  show_stack+0x14/0x20
> >> [   64.992418]  dump_stack+0x118/0x190
> >> [   64.999762]  print_address_description.isra.9+0x6c/0x3b8
> >> [   65.010953]  __kasan_report+0x134/0x23c
> >> [   65.019029]  kasan_report+0xc/0x18
> >> [   65.026188]  __asan_store8+0x94/0xb8
> >> [   65.033720]  run_timer_softirq+0x6f4/0xae0
> >> [   65.042343]  efi_header_end+0x16c/0x840
> >> [   65.050420]  irq_exit+0x19c/0x1a8
> >> [   65.057396]  __handle_domain_irq+0x7c/0xe0
> >> [   65.066022]  gic_handle_irq+0x64/0x168
> >> [   65.073917]  el1_irq+0xbc/0x180
> >> [   65.080528]  arch_cpu_idle+0x3c/0x320
> >> [   65.088239]  default_idle_call+0x28/0x4c
> >> [   65.096502]  do_idle+0x278/0x348
> >> [   65.103295]  cpu_startup_entry+0x24/0x40
> >> [   65.111554]  rest_init+0x1c4/0x298
> >> [   65.118718]  arch_call_rest_init+0xc/0x14
> >> [   65.127159]  start_kernel+0x848/0x888
> >>
> >> [   65.138006] Allocated by task 0:
> >> [   65.144802] (stack is not available)
> >>
> >> [   65.155465] Freed by task 0:
> >> [   65.161530] (stack is not available)
> >>
> >> [   65.172193] The buggy address belongs to the object at ffff0026b931aa00
> >>    which belongs to the cache pool_workqueue of size 256
> >> [   65.199113] The buggy address is located 224 bytes inside of
> >>    256-byte region [ffff0026b931aa00, ffff0026b931ab00)
> >> [   65.223840] The buggy address belongs to the page:
> >> [   65.233931] page:fffffe009ac4c600 refcount:1 mapcount:0
> >> mapping:ffff0026dd81c880 index:0xffff0026b931fe00 compound_mapcount: 0
> >> [   65.257923] flags: 0x6ffff00000010200(slab|head)
> >> [   65.267649] raw: 6ffff00000010200 fffffe009b20b208 fffffe009ac07608
> >> ffff0026dd81c880
> >> [   65.283959] raw: ffff0026b931fe00 0000000000400002 00000001ffffffff
> >> 0000000000000000
> >> [   65.300270] page dumped because: kasan: bad access detected
> >>
> >> [   65.315139] Memory state around the buggy address:
> >> [   65.325231]  ffff0026b931a980: fc fc fc fc fc fc fc fc fc fc fc fc fc
> >> fc fc fc
> >> [   65.340445]  ffff0026b931aa00: fc fc fc fc fc fc fc fc fc fc fc fc fc
> >> fc fc fc
> >> [   65.355660] >ffff0026b931aa80: fc fc fc fc fc fc fc fc fc fc fc fc fc
> >> fc fc fc
> >> [   65.370870]                                                        ^
> >> [   65.384256]  ffff0026b931ab00: fc fc fc fc fc fc fc fc fc fc fc fc fc
> >> fc fc fc
> >> [   65.399467]  ffff0026b931ab80: fc fc fc fc fc fc fc fc fc fc fc fc fc
> >> fc fc fc
> >> [   65.414675]
> >> ==================================================================
> >> [   65.429885] Disabling lock debugging due to kernel taint
> >> [   65.441431] Unable to handle kernel paging request at virtual address
> >> ffffa0001013c0b0
> >> [   65.441695] ehci-pci 0000:7a:01.0: EHCI Host Controller
> >> [   65.458088] Mem abort info:
> >> [   65.469183] ehci-pci 0000:7a:01.0: new USB bus registered, assigned
> >> bus number 1
> >> [   65.474927]   ESR = 0x96000007
> >> [   65.491201] ehci-pci 0000:7a:01.0: irq 65, io mem 0x20c101000
> >> [   65.496913]   EC = 0x25: DABT (current EL), IL = 32 bits
> >> [   65.496918]   SET = 0, FnV = 0
> >> [   65.496922]   EA = 0, S1PTW = 0
> >> [   65.522586] ehci-pci 0000:7a:01.0: USB 0.0 started, EHCI 1.00
> >> [   65.526575] Data abort info:
> >> [   65.526580]   ISV = 0, ISS = 0x00000007
> >> [   65.535948] hub 1-0:1.0: USB hub found
> >> [   65.545245]   CM = 0, WnR = 0
> >> [   65.545251] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000052530000
> >> [   65.545256] [ffffa0001013c0b0] pgd=00002027fffff003,
> >> pud=00002027ffffe003, pmd=00000026dda5b003, pte=0000000000000000
> >> [   65.551519] hub 1-0:1.0: 2 ports detected
> >> [   65.559375] Internal error: Oops: 96000007 [#1] PREEMPT SMP
> >> [   65.559379] Modules linked in:
> >> [   65.569534] ehci-platform: EHCI generic platform driver
> >> [   65.573475] CPU: 34 PID: 8 Comm: kworker/u256:0 Tainted: G    B
> >>         5.6.0-rc3-00005-g17ceebe3a05c-dirty #1775
> >> [   65.573477] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDD,
> >> BIOS 2280-V2 CS V3.B160.01 02/24/2020
> >> [   65.573487] Workqueue: poll_megasas0_status megasas_fault_detect_work
> >> [   65.573492] pstate: 80c00009 (Nzcv daif +PAN +UAO)
> >> [   65.588048] ehci-orion: EHCI orion driver
> >> [   65.609756] pc : megasas_readl+0x60/0x80
> >> [   65.609759] lr : megasas_readl+0x1c/0x80
> >> [   65.609761] sp : ffff0026d97bfc00
> >> [   65.609763] x29: ffff0026d97bfc00 x28: ffff0026d97a9890
> >> [   65.609767] x27: ffff0026d97a0618 x26: ffff0026d97a9880
> >> [   65.609771] x25: ffff0026d9758808 x24: ffff0026b931aa28
> >> [   65.609775] x23: ffff0026b931aa98 x22: ffffa0002931e000
> >> [   65.609779] x21: ffff0026dd898800 x20: ffff0026b931dcd8
> >> [   65.618543] ehci-exynos: EHCI Exynos driver
> >> [   65.629840] x19: ffffa0001013c0b0 x18: 0000000000000000
> >> [   65.629843] x17: 0000000000001d50 x16: ffffffffffffe240
> >> [   65.629847] x15: 00000000000013a8 x14: 0000000000000000
> >> [   65.629850] x13: 00000000000013a0 x12: 1fffe004db2f7f7c
> >> [   65.629854] x11: ffff8004db2f7f78 x10: dfffa00000000000
> >> [   65.629857] x9 : ffffa00028f679e8 x8 : ffffa0002a483a48
> >> [   65.629861] x7 : ffffa00026d5ed94 x6 : 0000000000000000
> >> [   65.629864] x5 : ffffa0002a483a48 x4 : 0000000000000000
> >> [   65.629868] x3 : ffffa000279df03c x2 : 0000000000000000
> >> [   65.636662] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
> >> [   65.647207] x1 : ef244e124d671400 x0 : 0000000000000004
> >> [   65.647210] Call trace:
> >> [   65.647214]  megasas_readl+0x60/0x80
> >> [   65.647218]  megasas_read_fw_status_reg_fusion+0x2c/0x38
> >> [   65.647221]  megasas_fault_detect_work+0x44/0x520
> >> [   65.647226]  process_one_work+0x488/0xc08
> >> [   65.647228]  worker_thread+0x68/0x5d0
> >> [   65.647233]  kthread+0x1c8/0x1d0
> >> [   65.669535] ohci-pci: OHCI PCI platform driver
> >> [   65.689683]  ret_from_fork+0x10/0x18
> >> [   65.689689] Code: 54ffff09 a94153f3 a8c27bfd d65f03c0 (b9400260)
> >> [   65.689695] ---[ end trace 3632c7efc4f2d69c ]---
> >>
> >>
> >> That's 5.6-rc3 .
> >>
> >> Please have a look,
> >>
> >> John
> >>
> >>
> >>
>