> -----Original Message----- > From: Alexander Murashkin <AlexanderMurashkin@xxxxxxx> > Sent: Wednesday, January 2, 2019 10:47 PM > To: Parav Pandit <parav@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx > Subject: Re: ib_ipoib: general protection fault in ib_destroy_qp -> > rdma_put_gid_attr+0x9/0x30 [ib_core] > > Hi Parav, > > I have built and installed on 3 servers the same kernel version with your > patch applied. So far, so good - IPoIB is working, no kernel errors in the logs. > Great. > Please let me know when you have "proper fix", we need to push it to > Fedora. > > BTW There is another qp = kmalloc(...) in the code. Does it need to be > changed? Yes. I will send the patch shortly to cover both cases. > > $ grep -n -C6 'qp = kmalloc' drivers/infiniband/hw/mthca/mthca_provider.c > 596- case IB_QPT_GSI: > 597- { > 598- /* Don't allow userspace to create special QPs */ > 599- if (pd->uobject) > 600- return ERR_PTR(-EINVAL); > 601- > 602: qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL); > 603- if (!qp) > 604- return ERR_PTR(-ENOMEM); > 605- > 606- qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1; > 607- > 608- err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd), > > Best regards, > > Alex Murashkin > > On 12/30/18 8:21 PM, Parav Pandit wrote: > > Hi Alexander, > > > >> -----Original Message----- > >> From: linux-rdma-owner@xxxxxxxxxxxxxxx <linux-rdma- > >> owner@xxxxxxxxxxxxxxx> On Behalf Of Alexander Murashkin > >> Sent: Monday, December 31, 2018 3:26 AM > >> To: linux-rdma@xxxxxxxxxxxxxxx > >> Subject: ib_ipoib: general protection fault in ib_destroy_qp -> > >> rdma_put_gid_attr+0x9/0x30 [ib_core] > >> > >> ipoib crashes in rdma_put_gid_attr. It happens every time, often > >> during boot process or soon after it, occasionally after few hours since a > reboot. > >> > >> After the crash, IPoIB stops working for new connections. Interesting > >> fact is that TCP sessions created before the crash continue to work. > >> > >> The problem occurs on four (4) servers. The servers are running > >> Fedora 29 with kernel 4.19.10-300.fc29.x86_64. Note that > >> 4.19.8-300.fc29.x86_64 has the same problem. > >> > >> The servers use the same Infiniband controller model, OS, kernel, and > >> drivers > >> > >> Device: 02:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost > III > >> Ex] (rev a0) > >> Firmware: 5.3.0 > >> Driver: ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008) > >> > >> More details at > >> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbug > >> > zilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1661864&data=02%7C01%7 > C%7C > >> > aaa37633966646eea1cc08d66ec69f57%7C84df9e7fe9f640afb435aaaaaaaaaaa > a%7 > >> > C1%7C0%7C636818196681805142&sdata=c7F76JB1t2c7le4VfL65yH%2BB > v1Lrd > >> d6ZtIiuWFUs4bs%3D&reserved=0 > >> > >> --------------------------------------------------------------------- > >> ---- > >> Additional info: > >> reporter: libreport-2.9.7 > >> general protection fault: 0000 [#1] SMP NOPTI > >> CPU: 3 PID: 74 Comm: kworker/u16:1 Not tainted > >> 4.19.10-300.fc29.x86_64 #1 Hardware name: To be filled by O.E.M. To > >> be filled by O.E.M./M5A99X EVO, BIOS 0402 05/16/2011 > >> Workqueue: ipoib_wq ipoib_cm_tx_reap [ib_ipoib] > >> RIP: 0010:rdma_put_gid_attr+0x9/0x30 [ib_core] > >> Code: 96 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 7b 30 e8 cc 0d c6 f1 > >> 48 89 df > >> e8 c4 0d c6 f1 eb c3 c3 90 0f 1f 44 00 00 48 8d 57 d8 <f0> ff 4f d8 > >> 0f 88 78 65 > >> 01 00 74 01 c3 48 8b 35 2b d0 02 00 48 83 > >> RSP: 0018:ffffb7ad819dbde8 EFLAGS: 00010202 > >> RAX: 0000000000000000 RBX: ffff8d1bdf5a2e00 RCX: 0000000000002699 > >> RDX: 206c656e72656af8 RSI: ffff8d1bf7ae6160 RDI: 206c656e72656b20 > >> RBP: 0000000000000000 R08: 0000000000026160 R09: ffffffffc06b45bf > >> R10: ffffe849887da000 R11: 0000000000000002 R12: ffff8d1be30cb400 > >> R13: ffff8d1bdf681800 R14: ffff8d1be2272400 R15: ffff8d1be30ca000 > >> FS: 0000000000000000(0000) GS:ffff8d1bf7ac0000(0000) > >> knlGS:0000000000000000 > >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >> CR2: 00007f4f99d5dc80 CR3: 000000021878e000 CR4: 00000000000006e0 > >> Call > >> Trace: > >> ib_destroy_qp+0xc9/0x240 [ib_core] > >> ipoib_cm_tx_reap+0x1f9/0x4e0 [ib_ipoib] > >> process_one_work+0x1a1/0x3a0 > >> worker_thread+0x30/0x380 > >> ? pwq_unbound_release_workfn+0xd0/0xd0 > >> kthread+0x112/0x130 > >> ? kthread_create_worker_on_cpu+0x70/0x70 > >> ret_from_fork+0x22/0x40 > >> Modules linked in: nf_log_ipv4 nf_log_common xt_LOG xt_limit > >> xt_multiport 8021q garp mrp stp llc ip6t_REJECT nf_reject_ipv6 > >> xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 > >> libcrc32c it87 hwmon_vid ip6table_filter ip6_tables ib_isert > >> iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp > >> rpcrdma rdma_ucm ib_uverbs ib_iser ib_umad rdma_cm iw_cm ib_ipoib > >> libiscsi scsi_transport_iscsi ib_cm eeepc_wmi amd64_edac_mod > asus_wmi > >> edac_mce_amd sparse_keymap rfkill kvm_amd video wmi_bmof > mxm_wmi kvm > >> irqbypass k10temp snd_hda_codec_realtek snd_hda_codec_hdmi > >> snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core > >> ib_mthca sp5100_tco snd_seq snd_hwdep snd_seq_device i2c_piix4 > >> snd_pcm ib_core snd_timer snd soundcore wmi pcc_cpufreq acpi_cpufreq > >> nfsd binfmt_misc nfs_acl > >> lockd grace auth_rpcgss sunrpc dm_crypt raid1 ata_generic > >> i2c_algo_bit uas drm_kms_helper pata_acpi ttm usb_storage > >> pata_marvell drm firewire_ohci firewire_core crc_itu_t r8169 ecryptfs > >> > >> --------------------------------------------------- > >> > >> # lspci | grep Mellanox > >> 02:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex] > >> (rev a0) > >> > >> # ibv_devinfo > >> hca_id: mthca0 > >> transport: InfiniBand (0) > >> fw_ver: 5.3.0 > >> node_guid: 0002:c902:0022:1228 > >> sys_image_guid: 0005:ad00:0100:d050 > >> vendor_id: 0x02c9 > >> vendor_part_id: 25218 > >> hw_ver: 0xA0 > >> board_id: MT_0150000001 > >> phys_port_cnt: 2 > >> port: 1 > >> state: PORT_ACTIVE (4) > >> max_mtu: 2048 (4) > >> active_mtu: 2048 (4) > >> sm_lid: 2 > >> port_lid: 5 > >> port_lmc: 0x00 > >> link_layer: InfiniBand > >> > >> port: 2 > >> state: PORT_ACTIVE (4) > >> max_mtu: 2048 (4) > >> active_mtu: 2048 (4) > >> sm_lid: 2 > >> port_lid: 6 > >> port_lmc: 0x00 > >> link_layer: InfiniBand > >> > > It seems that qp by mthca driver is not zero initialized during creation > time. > > Due to which there might be garbage pointer for alt_sgid_attr. > > > > Is it possible to apply/change below code and see if it progresses? > > I will generate a proper fix if this is the likely root cause. > > > > diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c > > b/drivers/infiniband/hw/mthca/mthca_provider.c > > index bfd741c..9f6c748 100644 > > --- a/drivers/infiniband/hw/mthca/mthca_provider.c > > +++ b/drivers/infiniband/hw/mthca/mthca_provider.c > > @@ -533,7 +533,7 @@ static struct ib_qp *mthca_create_qp(struct ib_pd > *pd, > > { > > struct mthca_ucontext *context; > > > > - qp = kmalloc(sizeof *qp, GFP_KERNEL); > > + qp = kzalloc(sizeof *qp, GFP_KERNEL); > > if (!qp) > > return ERR_PTR(-ENOMEM); > >