Fwd: Infiniband mthca driver crash on linux kernel 5.11 and higher

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



To whom it may concern,

I hope this is the right place to submit Infiniband driver bug reports.
As I understand it, cards that rely on the mthca driver are relatively
old and I would understand if they are no longer supported.
I've attached files detailing my tests and conditions and relevant
dmesg snippets. If there is any other information I can provide, or
somewhere else I should send this, please let me know.

Thanks in advance!
Andrew
Infiniband MTHCA module crash

Working on Ubuntu 20.04 kernel 5.4
Crashing on Alpine kernel 5.15.1, Xanmod kernel 5.14.17, Ubuntu 21.04 kernel 5.11
Rootfs is Alpine Linux Edge (nfs shared rootfs)
all nodes run the same rootfs regardless of kernel
all cards have the latest firmware. Switch is an IS5022.

OpenSM versions tested:
Distribution version(Alpine Testing); OFED 3.3.20 (with alpine musl fixes patch); linux-rdma (also with alpine musl fixes patch); Ubuntu 20.04 distribution (run from chroot). 

Crash conditions as tested:
Run opensm(any version listed above)
Opensm sits at "Entering DISCOVERY state"
dmesg entries as attached
module ib_mthca can no longer be unloaded

Working conditions (with Ubuntu 20.04 kernel 5.4)
Run up opensm same as above

ipoib interfaces can be brought up and ping other nodes that share the same kernel (5.4)

Note:
Nodes that are not running kernel 5.4 on the same switch will have their kernel modules crash when opensm is run on a node running kernel 5.4
Infiniband MTHCA module crash - 5.15.1

[   42.545456] BUG: unable to handle page fault for address: 0000000000040028
[   42.545464] #PF: supervisor read access in kernel mode
[   42.545467] #PF: error_code(0x0000) - not-present page
[   42.545469] PGD 0 P4D 0 
[   42.545471] Oops: 0000 [#1] SMP NOPTI
[   42.545474] CPU: 16 PID: 509 Comm: kworker/u65:0 Tainted: P           O      5.15.1-3-lts #4-Alpine
[   42.545478] Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 PC MATE (MS-7A34), BIOS A.LR 07/02/2020
[   42.545480] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
[   42.545499] RIP: 0010:mthca_poll_cq+0x1e1/0x860 [ib_mthca]
[   42.545505] Code: 8d 84 24 28 02 00 00 0f c9 41 2b 8c 24 64 02 00 00 89 ce 41 8b 8c 24 4c 02 00 00 d3 ee 89 f1 41 03 b4 24 f4 01 00 00 48 63 f6 <48> 8b 34 f7 49 89 37 48 85 c0 74 1b 44 8b 48 0c 8b 78 14 41 39 c9
[   42.545509] RSP: 0018:ffffaef8814a7ce0 EFLAGS: 00010006
[   42.545511] RAX: ffff9336081da728 RBX: ffff933637134000 RCX: 0000000000008005
[   42.545513] RDX: 0000000000000080 RSI: 0000000000008005 RDI: 0000000000000000
[   42.545515] RBP: ffff9336061c8400 R08: 000000000000000a R09: ffff933605af28b4
[   42.545517] R10: 0000000000000246 R11: 0000000000000000 R12: ffff9336081da500
[   42.545519] R13: ffff9336061c84e0 R14: 0000000000000000 R15: ffff9336108d4800
[   42.545521] FS:  0000000000000000(0000) GS:ffff9344cec00000(0000) knlGS:0000000000000000
[   42.545523] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   42.545525] CR2: 0000000000040028 CR3: 0000000105cac000 CR4: 0000000000350ee0
[   42.545527] Call Trace:
[   42.545530]  ? update_load_avg+0x78/0x5a0
[   42.545535]  ? newidle_balance+0x123/0x3f0
[   42.545538]  ? __switch_to_asm+0x42/0x70
[   42.545541]  ? finish_task_switch.isra.0+0xa7/0x280
[   42.545545]  __ib_process_cq+0x57/0x150 [ib_core]
[   42.545558]  ib_cq_poll_work+0x26/0x80 [ib_core]
[   42.545570]  process_one_work+0x1ec/0x390
[   42.545573]  worker_thread+0x53/0x3c0
[   42.545575]  ? process_one_work+0x390/0x390
[   42.545577]  kthread+0x127/0x150
[   42.545580]  ? set_kthread_struct+0x40/0x40
[   42.545583]  ret_from_fork+0x22/0x30
[   42.545586] Modules linked in: bonding xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter ip_tables x_tables bridge stp llc nfsd auth_rpcgss lockd grace sunrpc nls_utf8 nls_cp437 vfat fat ftdi_sio usbserial ib_ipoib ib_umad ib_cm af_packet r8153_ecm cdc_ether usbnet r8152 mii pcspkr efi_pstore ib_mthca ib_uverbs ib_core ipv6 sp5100_tco i2c_piix4 k10temp input_leds mousedev intel_rapl_msr joydev intel_rapl_common kvm_amd ccp rng_core kvm irqbypass crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd wmi_bmof rapl wmi parport_pc parport evdev button acpi_cpufreq efivarfs hid_generic usbhid hid crc32_pclmul r8169 realtek mdio_devres libphy nvme nvme_core hwmon ahci libahci libata xhci_pci xhci_pci_renesas xhci_hcd simpledrm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt
[   42.545632]  fb_sys_fops cfbcopyarea cec drm i2c_core drm_panel_orientation_quirks agpgart loop zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) ext4 crc32c_generic crc32c_intel crc16 mbcache jbd2 usb_storage usbcore usb_common sd_mod t10_pi scsi_mod
[   42.545662] CR2: 0000000000040028
[   42.545664] ---[ end trace 4634d65b0351fcb8 ]---
[   42.545665] RIP: 0010:mthca_poll_cq+0x1e1/0x860 [ib_mthca]
[   42.545670] Code: 8d 84 24 28 02 00 00 0f c9 41 2b 8c 24 64 02 00 00 89 ce 41 8b 8c 24 4c 02 00 00 d3 ee 89 f1 41 03 b4 24 f4 01 00 00 48 63 f6 <48> 8b 34 f7 49 89 37 48 85 c0 74 1b 44 8b 48 0c 8b 78 14 41 39 c9
[   42.545674] RSP: 0018:ffffaef8814a7ce0 EFLAGS: 00010006
[   42.545676] RAX: ffff9336081da728 RBX: ffff933637134000 RCX: 0000000000008005
[   42.545677] RDX: 0000000000000080 RSI: 0000000000008005 RDI: 0000000000000000
[   42.545679] RBP: ffff9336061c8400 R08: 000000000000000a R09: ffff933605af28b4
[   42.545681] R10: 0000000000000246 R11: 0000000000000000 R12: ffff9336081da500
[   42.545682] R13: ffff9336061c84e0 R14: 0000000000000000 R15: ffff9336108d4800
[   42.545684] FS:  0000000000000000(0000) GS:ffff9344cec00000(0000) knlGS:0000000000000000
[   42.545686] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   42.545688] CR2: 0000000000040028 CR3: 0000000105cac000 CR4: 0000000000350ee0
Infiniband MTHCA module crash - 5.14.17

[  169.267974] BUG: unable to handle page fault for address: 0000000000040028
[  169.267980] #PF: supervisor read access in kernel mode
[  169.267982] #PF: error_code(0x0000) - not-present page
[  169.267984] PGD 0 P4D 0 
[  169.267986] Oops: 0000 [#1] SMP NOPTI
[  169.267989] CPU: 11 PID: 891 Comm: kworker/u65:2 Not tainted 5.14.17-xanmod1 #0~git20211106.2bf32bb
[  169.267992] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A520M-HDV, BIOS P1.60 03/18/2021
[  169.267994] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
[  169.268005] RIP: 0010:mthca_poll_cq+0x1db/0x830 [ib_mthca]
[  169.268010] Code: 02 00 00 49 8d 85 28 02 00 00 0f c9 41 2b 8d 64 02 00 00 89 ce 41 8b 8d 4c 02 00 00 d3 ee 89 f1 41 03 b5 f4 01 00 00 48 63 f6 <48> 8b 34 f7 49 89 37 48 85 c0 74 1c 44 8b 48 0c 8b 78 14 41 39 c9
[  169.268013] RSP: 0018:ffffb70b810cfcd0 EFLAGS: 00010006
[  169.268014] RAX: ffff9a7e87878f28 RBX: ffff9a7e878d5000 RCX: 0000000000008005
[  169.268016] RDX: 0000000000000080 RSI: 0000000000008005 RDI: 0000000000000000
[  169.268017] RBP: ffffb70b810cfe18 R08: 000000000000000a R09: ffff9a7e8a13aa2c
[  169.268019] R10: 0000000000000282 R11: 0000000000000000 R12: ffff9a7e8db08c00
[  169.268020] R13: ffff9a7e87878d00 R14: 0000000000000000 R15: ffff9a7e8ded9000
[  169.268022] FS:  0000000000000000(0000) GS:ffff9a81becc0000(0000) knlGS:0000000000000000
[  169.268024] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  169.268025] CR2: 0000000000040028 CR3: 0000000106efc000 CR4: 0000000000750ee0
[  169.268027] PKRU: 55555554
[  169.268028] Call Trace:
[  169.268031]  ? release_sock+0xa/0x90
[  169.268035]  ? __cond_resched+0x11/0x40
[  169.268038]  ? update_load_avg+0x7a/0x530
[  169.268041]  ? newidle_balance+0x11b/0x3f0
[  169.268043]  ? dequeue_entity+0xc1/0x3f0
[  169.268045]  ? __switch_to_asm+0x42/0x70
[  169.268048]  ? finish_task_switch.isra.0+0xa2/0x280
[  169.268050]  __ib_process_cq+0x49/0xd0 [ib_core]
[  169.268058]  ib_cq_poll_work+0x21/0x80 [ib_core]
[  169.268065]  process_one_work+0x1f5/0x350
[  169.268068]  worker_thread+0x4b/0x400
[  169.268069]  ? process_one_work+0x350/0x350
[  169.268071]  kthread+0x122/0x140
[  169.268073]  ? set_kthread_struct+0x30/0x30
[  169.268076]  ret_from_fork+0x22/0x30
[  169.268079] Modules linked in: overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter ip_tables x_tables rpcsec_gss_krb5 bridge stp llc auth_rpcgss nfsv4 ib_qib rdmavt dca ib_ipoib ib_umad ib_cm wmi_bmof pcspkr efi_pstore nvme nvme_core ib_mthca ib_uverbs ib_core sp5100_tco ahci libahci k10temp intel_rapl_msr mac_hid intel_rapl_common edac_mce_amd kvm_amd ccp kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl nfsv3 nfs_acl nfs lockd grace sunrpc fscache netfs ftdi_sio usbserial r8169 crc32_pclmul realtek mdio_devres i2c_piix4 libphy xhci_pci xhci_pci_renesas wmi gpio_amdpt gpio_generic
[  169.268114] CR2: 0000000000040028
[  169.268115] ---[ end trace 0b7e3d6ee2a7b04a ]---
[  169.307003] RIP: 0010:mthca_poll_cq+0x1db/0x830 [ib_mthca]
[  169.307009] Code: 02 00 00 49 8d 85 28 02 00 00 0f c9 41 2b 8d 64 02 00 00 89 ce 41 8b 8d 4c 02 00 00 d3 ee 89 f1 41 03 b5 f4 01 00 00 48 63 f6 <48> 8b 34 f7 49 89 37 48 85 c0 74 1c 44 8b 48 0c 8b 78 14 41 39 c9
[  169.307011] RSP: 0018:ffffb70b810cfcd0 EFLAGS: 00010006
[  169.307013] RAX: ffff9a7e87878f28 RBX: ffff9a7e878d5000 RCX: 0000000000008005
[  169.307014] RDX: 0000000000000080 RSI: 0000000000008005 RDI: 0000000000000000
[  169.307015] RBP: ffffb70b810cfe18 R08: 000000000000000a R09: ffff9a7e8a13aa2c
[  169.307017] R10: 0000000000000282 R11: 0000000000000000 R12: ffff9a7e8db08c00
[  169.307018] R13: ffff9a7e87878d00 R14: 0000000000000000 R15: ffff9a7e8ded9000
[  169.307019] FS:  0000000000000000(0000) GS:ffff9a81becc0000(0000) knlGS:0000000000000000
[  169.307020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  169.307021] CR2: 0000000000040028 CR3: 0000000106efc000 CR4: 0000000000750ee0
[  169.307023] PKRU: 55555554

[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux