Hi Alexander, > -----Original Message----- > From: linux-rdma-owner@xxxxxxxxxxxxxxx <linux-rdma- > owner@xxxxxxxxxxxxxxx> On Behalf Of Alexander Murashkin > Sent: Monday, December 31, 2018 3:26 AM > To: linux-rdma@xxxxxxxxxxxxxxx > Subject: ib_ipoib: general protection fault in ib_destroy_qp -> > rdma_put_gid_attr+0x9/0x30 [ib_core] > > ipoib crashes in rdma_put_gid_attr. It happens every time, often during boot > process or soon after it, occasionally after few hours since a reboot. > > After the crash, IPoIB stops working for new connections. Interesting fact is > that TCP sessions created before the crash continue to work. > > The problem occurs on four (4) servers. The servers are running Fedora 29 > with kernel 4.19.10-300.fc29.x86_64. Note that 4.19.8-300.fc29.x86_64 has > the same problem. > > The servers use the same Infiniband controller model, OS, kernel, and drivers > > Device: 02:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III > Ex] (rev a0) > Firmware: 5.3.0 > Driver: ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) > > More details at https://bugzilla.redhat.com/show_bug.cgi?id=1661864 > > ------------------------------------------------------------------------- > Additional info: > reporter: libreport-2.9.7 > general protection fault: 0000 [#1] SMP NOPTI > CPU: 3 PID: 74 Comm: kworker/u16:1 Not tainted 4.19.10-300.fc29.x86_64 #1 > Hardware name: To be filled by O.E.M. To be filled by O.E.M./M5A99X EVO, > BIOS 0402 05/16/2011 > Workqueue: ipoib_wq ipoib_cm_tx_reap [ib_ipoib] > RIP: 0010:rdma_put_gid_attr+0x9/0x30 [ib_core] > Code: 96 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 7b 30 e8 cc 0d c6 f1 48 89 df > e8 c4 0d c6 f1 eb c3 c3 90 0f 1f 44 00 00 48 8d 57 d8 <f0> ff 4f d8 0f 88 78 65 > 01 00 74 01 c3 48 8b 35 2b d0 02 00 48 83 > RSP: 0018:ffffb7ad819dbde8 EFLAGS: 00010202 > RAX: 0000000000000000 RBX: ffff8d1bdf5a2e00 RCX: 0000000000002699 > RDX: 206c656e72656af8 RSI: ffff8d1bf7ae6160 RDI: 206c656e72656b20 > RBP: 0000000000000000 R08: 0000000000026160 R09: ffffffffc06b45bf > R10: ffffe849887da000 R11: 0000000000000002 R12: ffff8d1be30cb400 > R13: ffff8d1bdf681800 R14: ffff8d1be2272400 R15: ffff8d1be30ca000 > FS: 0000000000000000(0000) GS:ffff8d1bf7ac0000(0000) > knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f4f99d5dc80 CR3: 000000021878e000 CR4: 00000000000006e0 Call > Trace: > ib_destroy_qp+0xc9/0x240 [ib_core] > ipoib_cm_tx_reap+0x1f9/0x4e0 [ib_ipoib] > process_one_work+0x1a1/0x3a0 > worker_thread+0x30/0x380 > ? pwq_unbound_release_workfn+0xd0/0xd0 > kthread+0x112/0x130 > ? kthread_create_worker_on_cpu+0x70/0x70 > ret_from_fork+0x22/0x40 > Modules linked in: nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_multiport > 8021q garp mrp stp llc ip6t_REJECT nf_reject_ipv6 xt_state xt_conntrack > nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c it87 hwmon_vid > ip6table_filter ip6_tables ib_isert iscsi_target_mod ib_srpt target_core_mod > ib_srp scsi_transport_srp rpcrdma rdma_ucm ib_uverbs ib_iser ib_umad > rdma_cm iw_cm ib_ipoib libiscsi scsi_transport_iscsi ib_cm eeepc_wmi > amd64_edac_mod asus_wmi edac_mce_amd sparse_keymap rfkill kvm_amd > video wmi_bmof mxm_wmi kvm irqbypass k10temp snd_hda_codec_realtek > snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel > snd_hda_codec snd_hda_core ib_mthca sp5100_tco snd_seq snd_hwdep > snd_seq_device i2c_piix4 snd_pcm ib_core snd_timer snd soundcore wmi > pcc_cpufreq acpi_cpufreq nfsd binfmt_misc nfs_acl > lockd grace auth_rpcgss sunrpc dm_crypt raid1 ata_generic i2c_algo_bit uas > drm_kms_helper pata_acpi ttm usb_storage pata_marvell drm firewire_ohci > firewire_core crc_itu_t r8169 ecryptfs > > --------------------------------------------------- > > # lspci | grep Mellanox > 02:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex] (rev a0) > > # ibv_devinfo > hca_id: mthca0 > transport: InfiniBand (0) > fw_ver: 5.3.0 > node_guid: 0002:c902:0022:1228 > sys_image_guid: 0005:ad00:0100:d050 > vendor_id: 0x02c9 > vendor_part_id: 25218 > hw_ver: 0xA0 > board_id: MT_0150000001 > phys_port_cnt: 2 > port: 1 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 2 > port_lid: 5 > port_lmc: 0x00 > link_layer: InfiniBand > > port: 2 > state: PORT_ACTIVE (4) > max_mtu: 2048 (4) > active_mtu: 2048 (4) > sm_lid: 2 > port_lid: 6 > port_lmc: 0x00 > link_layer: InfiniBand > It seems that qp by mthca driver is not zero initialized during creation time. Due to which there might be garbage pointer for alt_sgid_attr. Is it possible to apply/change below code and see if it progresses? I will generate a proper fix if this is the likely root cause. diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c b/drivers/infiniband/hw/mthca/mthca_provider.c index bfd741c..9f6c748 100644 --- a/drivers/infiniband/hw/mthca/mthca_provider.c +++ b/drivers/infiniband/hw/mthca/mthca_provider.c @@ -533,7 +533,7 @@ static struct ib_qp *mthca_create_qp(struct ib_pd *pd, { struct mthca_ucontext *context; - qp = kmalloc(sizeof *qp, GFP_KERNEL); + qp = kzalloc(sizeof *qp, GFP_KERNEL); if (!qp) return ERR_PTR(-ENOMEM);