On Fri, Nov 12, 2021 at 11:34:12AM -0800, Drew Fitch wrote: > To whom it may concern, > > I hope this is the right place to submit Infiniband driver bug reports. > As I understand it, cards that rely on the mthca driver are relatively > old and I would understand if they are no longer supported. > I've attached files detailing my tests and conditions and relevant > dmesg snippets. If there is any other information I can provide, or > somewhere else I should send this, please let me know. > > Thanks in advance! > Andrew > Infiniband MTHCA module crash > > Working on Ubuntu 20.04 kernel 5.4 > Crashing on Alpine kernel 5.15.1, Xanmod kernel 5.14.17, Ubuntu 21.04 kernel 5.11 > Rootfs is Alpine Linux Edge (nfs shared rootfs) > all nodes run the same rootfs regardless of kernel > all cards have the latest firmware. Switch is an IS5022. I don't know if anyone from the active developers in this ML have such card. Can you try to find which field in mthca_poll_cq() causes to this crash? Thanks > > OpenSM versions tested: > Distribution version(Alpine Testing); OFED 3.3.20 (with alpine musl fixes patch); linux-rdma (also with alpine musl fixes patch); Ubuntu 20.04 distribution (run from chroot). > > Crash conditions as tested: > Run opensm(any version listed above) > Opensm sits at "Entering DISCOVERY state" > dmesg entries as attached > module ib_mthca can no longer be unloaded > > Working conditions (with Ubuntu 20.04 kernel 5.4) > Run up opensm same as above > > ipoib interfaces can be brought up and ping other nodes that share the same kernel (5.4) > > Note: > Nodes that are not running kernel 5.4 on the same switch will have their kernel modules crash when opensm is run on a node running kernel 5.4 > Infiniband MTHCA module crash - 5.15.1 > > [ 42.545456] BUG: unable to handle page fault for address: 0000000000040028 > [ 42.545464] #PF: supervisor read access in kernel mode > [ 42.545467] #PF: error_code(0x0000) - not-present page > [ 42.545469] PGD 0 P4D 0 > [ 42.545471] Oops: 0000 [#1] SMP NOPTI > [ 42.545474] CPU: 16 PID: 509 Comm: kworker/u65:0 Tainted: P O 5.15.1-3-lts #4-Alpine > [ 42.545478] Hardware name: Micro-Star International Co., Ltd. MS-7A34/B350 PC MATE (MS-7A34), BIOS A.LR 07/02/2020 > [ 42.545480] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core] > [ 42.545499] RIP: 0010:mthca_poll_cq+0x1e1/0x860 [ib_mthca] > [ 42.545505] Code: 8d 84 24 28 02 00 00 0f c9 41 2b 8c 24 64 02 00 00 89 ce 41 8b 8c 24 4c 02 00 00 d3 ee 89 f1 41 03 b4 24 f4 01 00 00 48 63 f6 <48> 8b 34 f7 49 89 37 48 85 c0 74 1b 44 8b 48 0c 8b 78 14 41 39 c9 > [ 42.545509] RSP: 0018:ffffaef8814a7ce0 EFLAGS: 00010006 > [ 42.545511] RAX: ffff9336081da728 RBX: ffff933637134000 RCX: 0000000000008005 > [ 42.545513] RDX: 0000000000000080 RSI: 0000000000008005 RDI: 0000000000000000 > [ 42.545515] RBP: ffff9336061c8400 R08: 000000000000000a R09: ffff933605af28b4 > [ 42.545517] R10: 0000000000000246 R11: 0000000000000000 R12: ffff9336081da500 > [ 42.545519] R13: ffff9336061c84e0 R14: 0000000000000000 R15: ffff9336108d4800 > [ 42.545521] FS: 0000000000000000(0000) GS:ffff9344cec00000(0000) knlGS:0000000000000000 > [ 42.545523] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 42.545525] CR2: 0000000000040028 CR3: 0000000105cac000 CR4: 0000000000350ee0 > [ 42.545527] Call Trace: > [ 42.545530] ? update_load_avg+0x78/0x5a0 > [ 42.545535] ? newidle_balance+0x123/0x3f0 > [ 42.545538] ? __switch_to_asm+0x42/0x70 > [ 42.545541] ? finish_task_switch.isra.0+0xa7/0x280 > [ 42.545545] __ib_process_cq+0x57/0x150 [ib_core] > [ 42.545558] ib_cq_poll_work+0x26/0x80 [ib_core] > [ 42.545570] process_one_work+0x1ec/0x390 > [ 42.545573] worker_thread+0x53/0x3c0 > [ 42.545575] ? process_one_work+0x390/0x390 > [ 42.545577] kthread+0x127/0x150 > [ 42.545580] ? set_kthread_struct+0x40/0x40 > [ 42.545583] ret_from_fork+0x22/0x30 > [ 42.545586] Modules linked in: bonding xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter ip_tables x_tables bridge stp llc nfsd auth_rpcgss lockd grace sunrpc nls_utf8 nls_cp437 vfat fat ftdi_sio usbserial ib_ipoib ib_umad ib_cm af_packet r8153_ecm cdc_ether usbnet r8152 mii pcspkr efi_pstore ib_mthca ib_uverbs ib_core ipv6 sp5100_tco i2c_piix4 k10temp input_leds mousedev intel_rapl_msr joydev intel_rapl_common kvm_amd ccp rng_core kvm irqbypass crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd wmi_bmof rapl wmi parport_pc parport evdev button acpi_cpufreq efivarfs hid_generic usbhid hid crc32_pclmul r8169 realtek mdio_devres libphy nvme nvme_core hwmon ahci libahci libata xhci_pci xhci_pci_renesas xhci_hcd simpledrm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt > [ 42.545632] fb_sys_fops cfbcopyarea cec drm i2c_core drm_panel_orientation_quirks agpgart loop zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) ext4 crc32c_generic crc32c_intel crc16 mbcache jbd2 usb_storage usbcore usb_common sd_mod t10_pi scsi_mod > [ 42.545662] CR2: 0000000000040028 > [ 42.545664] ---[ end trace 4634d65b0351fcb8 ]--- > [ 42.545665] RIP: 0010:mthca_poll_cq+0x1e1/0x860 [ib_mthca] > [ 42.545670] Code: 8d 84 24 28 02 00 00 0f c9 41 2b 8c 24 64 02 00 00 89 ce 41 8b 8c 24 4c 02 00 00 d3 ee 89 f1 41 03 b4 24 f4 01 00 00 48 63 f6 <48> 8b 34 f7 49 89 37 48 85 c0 74 1b 44 8b 48 0c 8b 78 14 41 39 c9 > [ 42.545674] RSP: 0018:ffffaef8814a7ce0 EFLAGS: 00010006 > [ 42.545676] RAX: ffff9336081da728 RBX: ffff933637134000 RCX: 0000000000008005 > [ 42.545677] RDX: 0000000000000080 RSI: 0000000000008005 RDI: 0000000000000000 > [ 42.545679] RBP: ffff9336061c8400 R08: 000000000000000a R09: ffff933605af28b4 > [ 42.545681] R10: 0000000000000246 R11: 0000000000000000 R12: ffff9336081da500 > [ 42.545682] R13: ffff9336061c84e0 R14: 0000000000000000 R15: ffff9336108d4800 > [ 42.545684] FS: 0000000000000000(0000) GS:ffff9344cec00000(0000) knlGS:0000000000000000 > [ 42.545686] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 42.545688] CR2: 0000000000040028 CR3: 0000000105cac000 CR4: 0000000000350ee0 > Infiniband MTHCA module crash - 5.14.17 > > [ 169.267974] BUG: unable to handle page fault for address: 0000000000040028 > [ 169.267980] #PF: supervisor read access in kernel mode > [ 169.267982] #PF: error_code(0x0000) - not-present page > [ 169.267984] PGD 0 P4D 0 > [ 169.267986] Oops: 0000 [#1] SMP NOPTI > [ 169.267989] CPU: 11 PID: 891 Comm: kworker/u65:2 Not tainted 5.14.17-xanmod1 #0~git20211106.2bf32bb > [ 169.267992] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./A520M-HDV, BIOS P1.60 03/18/2021 > [ 169.267994] Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core] > [ 169.268005] RIP: 0010:mthca_poll_cq+0x1db/0x830 [ib_mthca] > [ 169.268010] Code: 02 00 00 49 8d 85 28 02 00 00 0f c9 41 2b 8d 64 02 00 00 89 ce 41 8b 8d 4c 02 00 00 d3 ee 89 f1 41 03 b5 f4 01 00 00 48 63 f6 <48> 8b 34 f7 49 89 37 48 85 c0 74 1c 44 8b 48 0c 8b 78 14 41 39 c9 > [ 169.268013] RSP: 0018:ffffb70b810cfcd0 EFLAGS: 00010006 > [ 169.268014] RAX: ffff9a7e87878f28 RBX: ffff9a7e878d5000 RCX: 0000000000008005 > [ 169.268016] RDX: 0000000000000080 RSI: 0000000000008005 RDI: 0000000000000000 > [ 169.268017] RBP: ffffb70b810cfe18 R08: 000000000000000a R09: ffff9a7e8a13aa2c > [ 169.268019] R10: 0000000000000282 R11: 0000000000000000 R12: ffff9a7e8db08c00 > [ 169.268020] R13: ffff9a7e87878d00 R14: 0000000000000000 R15: ffff9a7e8ded9000 > [ 169.268022] FS: 0000000000000000(0000) GS:ffff9a81becc0000(0000) knlGS:0000000000000000 > [ 169.268024] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 169.268025] CR2: 0000000000040028 CR3: 0000000106efc000 CR4: 0000000000750ee0 > [ 169.268027] PKRU: 55555554 > [ 169.268028] Call Trace: > [ 169.268031] ? release_sock+0xa/0x90 > [ 169.268035] ? __cond_resched+0x11/0x40 > [ 169.268038] ? update_load_avg+0x7a/0x530 > [ 169.268041] ? newidle_balance+0x11b/0x3f0 > [ 169.268043] ? dequeue_entity+0xc1/0x3f0 > [ 169.268045] ? __switch_to_asm+0x42/0x70 > [ 169.268048] ? finish_task_switch.isra.0+0xa2/0x280 > [ 169.268050] __ib_process_cq+0x49/0xd0 [ib_core] > [ 169.268058] ib_cq_poll_work+0x21/0x80 [ib_core] > [ 169.268065] process_one_work+0x1f5/0x350 > [ 169.268068] worker_thread+0x4b/0x400 > [ 169.268069] ? process_one_work+0x350/0x350 > [ 169.268071] kthread+0x122/0x140 > [ 169.268073] ? set_kthread_struct+0x30/0x30 > [ 169.268076] ret_from_fork+0x22/0x30 > [ 169.268079] Modules linked in: overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter ip_tables x_tables rpcsec_gss_krb5 bridge stp llc auth_rpcgss nfsv4 ib_qib rdmavt dca ib_ipoib ib_umad ib_cm wmi_bmof pcspkr efi_pstore nvme nvme_core ib_mthca ib_uverbs ib_core sp5100_tco ahci libahci k10temp intel_rapl_msr mac_hid intel_rapl_common edac_mce_amd kvm_amd ccp kvm crct10dif_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl nfsv3 nfs_acl nfs lockd grace sunrpc fscache netfs ftdi_sio usbserial r8169 crc32_pclmul realtek mdio_devres i2c_piix4 libphy xhci_pci xhci_pci_renesas wmi gpio_amdpt gpio_generic > [ 169.268114] CR2: 0000000000040028 > [ 169.268115] ---[ end trace 0b7e3d6ee2a7b04a ]--- > [ 169.307003] RIP: 0010:mthca_poll_cq+0x1db/0x830 [ib_mthca] > [ 169.307009] Code: 02 00 00 49 8d 85 28 02 00 00 0f c9 41 2b 8d 64 02 00 00 89 ce 41 8b 8d 4c 02 00 00 d3 ee 89 f1 41 03 b5 f4 01 00 00 48 63 f6 <48> 8b 34 f7 49 89 37 48 85 c0 74 1c 44 8b 48 0c 8b 78 14 41 39 c9 > [ 169.307011] RSP: 0018:ffffb70b810cfcd0 EFLAGS: 00010006 > [ 169.307013] RAX: ffff9a7e87878f28 RBX: ffff9a7e878d5000 RCX: 0000000000008005 > [ 169.307014] RDX: 0000000000000080 RSI: 0000000000008005 RDI: 0000000000000000 > [ 169.307015] RBP: ffffb70b810cfe18 R08: 000000000000000a R09: ffff9a7e8a13aa2c > [ 169.307017] R10: 0000000000000282 R11: 0000000000000000 R12: ffff9a7e8db08c00 > [ 169.307018] R13: ffff9a7e87878d00 R14: 0000000000000000 R15: ffff9a7e8ded9000 > [ 169.307019] FS: 0000000000000000(0000) GS:ffff9a81becc0000(0000) knlGS:0000000000000000 > [ 169.307020] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 169.307021] CR2: 0000000000040028 CR3: 0000000106efc000 CR4: 0000000000750ee0 > [ 169.307023] PKRU: 55555554