RE: ib_ipoib: general protection fault in ib_destroy_qp -> rdma_put_gid_attr+0x9/0x30 [ib_core]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> -----Original Message-----
> From: Alexander Murashkin <AlexanderMurashkin@xxxxxxx>
> Sent: Wednesday, January 2, 2019 10:47 PM
> To: Parav Pandit <parav@xxxxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx
> Subject: Re: ib_ipoib: general protection fault in ib_destroy_qp ->
> rdma_put_gid_attr+0x9/0x30 [ib_core]
> 
> Hi Parav,
> 
> I have built and installed on 3 servers the same kernel version with your
> patch applied. So far, so good - IPoIB is working, no kernel errors in the logs.
> 
Great.

> Please let me know when you have "proper fix", we need to push it to
> Fedora.
> 
> BTW There is another qp = kmalloc(...) in the code. Does it need to be
> changed?

Yes.
I will send the patch shortly to cover both cases.

> 
> $ grep -n -C6  'qp = kmalloc' drivers/infiniband/hw/mthca/mthca_provider.c
> 596-    case IB_QPT_GSI:
> 597-    {
> 598-        /* Don't allow userspace to create special QPs */
> 599-        if (pd->uobject)
> 600-            return ERR_PTR(-EINVAL);
> 601-
> 602:        qp = kmalloc(sizeof (struct mthca_sqp), GFP_KERNEL);
> 603-        if (!qp)
> 604-            return ERR_PTR(-ENOMEM);
> 605-
> 606-        qp->ibqp.qp_num = init_attr->qp_type == IB_QPT_SMI ? 0 : 1;
> 607-
> 608-        err = mthca_alloc_sqp(to_mdev(pd->device), to_mpd(pd),
> 
> Best regards,
> 
>      Alex Murashkin
> 
> On 12/30/18 8:21 PM, Parav Pandit wrote:
> > Hi Alexander,
> >
> >> -----Original Message-----
> >> From: linux-rdma-owner@xxxxxxxxxxxxxxx <linux-rdma-
> >> owner@xxxxxxxxxxxxxxx> On Behalf Of Alexander Murashkin
> >> Sent: Monday, December 31, 2018 3:26 AM
> >> To: linux-rdma@xxxxxxxxxxxxxxx
> >> Subject: ib_ipoib: general protection fault in ib_destroy_qp ->
> >> rdma_put_gid_attr+0x9/0x30 [ib_core]
> >>
> >> ipoib crashes in rdma_put_gid_attr. It happens every time, often
> >> during boot process or soon after it, occasionally after few hours since a
> reboot.
> >>
> >> After the crash, IPoIB stops working for new connections. Interesting
> >> fact is that TCP sessions created before the crash continue to work.
> >>
> >> The problem occurs on four (4) servers. The servers are running
> >> Fedora 29 with kernel 4.19.10-300.fc29.x86_64. Note that
> >> 4.19.8-300.fc29.x86_64 has the same problem.
> >>
> >> The servers use the same Infiniband controller model, OS, kernel, and
> >> drivers
> >>
> >> Device:   02:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost
> III
> >> Ex] (rev a0)
> >> Firmware: 5.3.0
> >> Driver:   ib_mthca: Mellanox InfiniBand HCA driver v1.0 (April 4, 2008)
> >>
> >> More details at
> >> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbug
> >>
> zilla.redhat.com%2Fshow_bug.cgi%3Fid%3D1661864&amp;data=02%7C01%7
> C%7C
> >>
> aaa37633966646eea1cc08d66ec69f57%7C84df9e7fe9f640afb435aaaaaaaaaaa
> a%7
> >>
> C1%7C0%7C636818196681805142&amp;sdata=c7F76JB1t2c7le4VfL65yH%2BB
> v1Lrd
> >> d6ZtIiuWFUs4bs%3D&amp;reserved=0
> >>
> >> ---------------------------------------------------------------------
> >> ----
> >> Additional info:
> >> reporter:       libreport-2.9.7
> >> general protection fault: 0000 [#1] SMP NOPTI
> >> CPU: 3 PID: 74 Comm: kworker/u16:1 Not tainted
> >> 4.19.10-300.fc29.x86_64 #1 Hardware name: To be filled by O.E.M. To
> >> be filled by O.E.M./M5A99X EVO, BIOS 0402 05/16/2011
> >> Workqueue: ipoib_wq ipoib_cm_tx_reap [ib_ipoib]
> >> RIP: 0010:rdma_put_gid_attr+0x9/0x30 [ib_core]
> >> Code: 96 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 7b 30 e8 cc 0d c6 f1
> >> 48 89 df
> >> e8 c4 0d c6 f1 eb c3 c3 90 0f 1f 44 00 00 48 8d 57 d8 <f0> ff 4f d8
> >> 0f 88 78 65
> >> 01 00 74 01 c3 48 8b 35 2b d0 02 00 48 83
> >> RSP: 0018:ffffb7ad819dbde8 EFLAGS: 00010202
> >> RAX: 0000000000000000 RBX: ffff8d1bdf5a2e00 RCX: 0000000000002699
> >> RDX: 206c656e72656af8 RSI: ffff8d1bf7ae6160 RDI: 206c656e72656b20
> >> RBP: 0000000000000000 R08: 0000000000026160 R09: ffffffffc06b45bf
> >> R10: ffffe849887da000 R11: 0000000000000002 R12: ffff8d1be30cb400
> >> R13: ffff8d1bdf681800 R14: ffff8d1be2272400 R15: ffff8d1be30ca000
> >> FS:  0000000000000000(0000) GS:ffff8d1bf7ac0000(0000)
> >> knlGS:0000000000000000
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: 00007f4f99d5dc80 CR3: 000000021878e000 CR4: 00000000000006e0
> >> Call
> >> Trace:
> >>  ib_destroy_qp+0xc9/0x240 [ib_core]
> >>  ipoib_cm_tx_reap+0x1f9/0x4e0 [ib_ipoib]
> >>  process_one_work+0x1a1/0x3a0
> >>  worker_thread+0x30/0x380
> >>  ? pwq_unbound_release_workfn+0xd0/0xd0
> >>  kthread+0x112/0x130
> >>  ? kthread_create_worker_on_cpu+0x70/0x70
> >>  ret_from_fork+0x22/0x40
> >> Modules linked in: nf_log_ipv4 nf_log_common xt_LOG xt_limit
> >> xt_multiport 8021q garp mrp stp llc ip6t_REJECT nf_reject_ipv6
> >> xt_state xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
> >> libcrc32c it87 hwmon_vid ip6table_filter ip6_tables ib_isert
> >> iscsi_target_mod ib_srpt target_core_mod ib_srp scsi_transport_srp
> >> rpcrdma rdma_ucm ib_uverbs ib_iser ib_umad rdma_cm iw_cm ib_ipoib
> >> libiscsi scsi_transport_iscsi ib_cm eeepc_wmi amd64_edac_mod
> asus_wmi
> >> edac_mce_amd sparse_keymap rfkill kvm_amd video wmi_bmof
> mxm_wmi kvm
> >> irqbypass k10temp snd_hda_codec_realtek snd_hda_codec_hdmi
> >> snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core
> >> ib_mthca sp5100_tco snd_seq snd_hwdep snd_seq_device i2c_piix4
> >> snd_pcm ib_core snd_timer snd soundcore wmi pcc_cpufreq acpi_cpufreq
> >> nfsd binfmt_misc nfs_acl
> >>  lockd grace auth_rpcgss sunrpc dm_crypt raid1 ata_generic
> >> i2c_algo_bit uas drm_kms_helper pata_acpi ttm usb_storage
> >> pata_marvell drm firewire_ohci firewire_core crc_itu_t r8169 ecryptfs
> >>
> >> ---------------------------------------------------
> >>
> >> # lspci | grep Mellanox
> >> 02:00.0 InfiniBand: Mellanox Technologies MT25208 [InfiniHost III Ex]
> >> (rev a0)
> >>
> >> # ibv_devinfo
> >> hca_id:	mthca0
> >> 	transport:			InfiniBand (0)
> >> 	fw_ver:				5.3.0
> >> 	node_guid:			0002:c902:0022:1228
> >> 	sys_image_guid:			0005:ad00:0100:d050
> >> 	vendor_id:			0x02c9
> >> 	vendor_part_id:			25218
> >> 	hw_ver:				0xA0
> >> 	board_id:			MT_0150000001
> >> 	phys_port_cnt:			2
> >> 		port:	1
> >> 			state:			PORT_ACTIVE (4)
> >> 			max_mtu:		2048 (4)
> >> 			active_mtu:		2048 (4)
> >> 			sm_lid:			2
> >> 			port_lid:		5
> >> 			port_lmc:		0x00
> >> 			link_layer:		InfiniBand
> >>
> >> 		port:	2
> >> 			state:			PORT_ACTIVE (4)
> >> 			max_mtu:		2048 (4)
> >> 			active_mtu:		2048 (4)
> >> 			sm_lid:			2
> >> 			port_lid:		6
> >> 			port_lmc:		0x00
> >> 			link_layer:		InfiniBand
> >>
> > It seems that qp by mthca driver is not zero initialized during creation
> time.
> > Due to which there might be garbage pointer for alt_sgid_attr.
> >
> > Is it possible to apply/change below code and see if it progresses?
> > I will generate a proper fix if this is the likely root cause.
> >
> > diff --git a/drivers/infiniband/hw/mthca/mthca_provider.c
> > b/drivers/infiniband/hw/mthca/mthca_provider.c
> > index bfd741c..9f6c748 100644
> > --- a/drivers/infiniband/hw/mthca/mthca_provider.c
> > +++ b/drivers/infiniband/hw/mthca/mthca_provider.c
> > @@ -533,7 +533,7 @@ static struct ib_qp *mthca_create_qp(struct ib_pd
> *pd,
> >         {
> >                 struct mthca_ucontext *context;
> >
> > -               qp = kmalloc(sizeof *qp, GFP_KERNEL);
> > +               qp = kzalloc(sizeof *qp, GFP_KERNEL);
> >                 if (!qp)
> >                         return ERR_PTR(-ENOMEM);
> >




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux