RE: rdma resource warning on 4.16-rc1 when unloading qedr after NFS mount

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
> Sent: Wednesday, February 14, 2018 6:58 PM
> To: Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx>
> Cc: Leon Romanovsky <leon@xxxxxxxxxx>; Le, Thong
> <Thong.Le@xxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx
> Subject: Re: rdma resource warning on 4.16-rc1 when unloading qedr after
> NFS mount
> 
> 
> 
> > On Feb 14, 2018, at 11:49 AM, Kalderon, Michal
> <Michal.Kalderon@xxxxxxxxxx> wrote:
> >
> >> From: Leon Romanovsky [mailto:leon@xxxxxxxxxx]
> >> Sent: Wednesday, February 14, 2018 6:34 PM
> >> To: Chuck Lever <chuck.lever@xxxxxxxxxx>
> >> Cc: Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx>; Le, Thong
> >> <Thong.Le@xxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx
> >> Subject: Re: rdma resource warning on 4.16-rc1 when unloading qedr
> >> after NFS mount
> >>
> >> On Wed, Feb 14, 2018 at 11:20:39AM -0500, Chuck Lever wrote:
> >>>
> >>>
> >>>> On Feb 14, 2018, at 11:00 AM, Kalderon, Michal
> >> <Michal.Kalderon@xxxxxxxxxx> wrote:
> >>>>
> >>>> Hi Leon, Chuck,
> >>>>
> >>>> We ran nfs mount over qedr using 4.16-rc1 When unloading qedr we
> >>>> get a WARNING from the resource tracker ( pasted below)
> >>>>
> >>>> Can you please advise on the best way to debug this? How can we get
> >> more info on the resource not being freed?
> >>>
> >>> I haven't seen this kind of report before, so I can't directly
> >>> answer your questions. But can you tell us more about reproducing it:
> >>
> >> It is resource tracking which was entered in last merge window.
> >>
> >>>
> >>> - Is there a workload running on the NFS mount point when the module
> >>> is unloaded?
> > no
> >>>
> >>> - Is the issue 100% reproducible, or intermittent?
> > Seems to be
> >>>
> >>> - Have you tried bisecting?
> > No, bisecting is a tough one here since we ran this scenario to verify
> > the last Two related nfs fixes
> > e89e8d8 xprtrdma: Fix BUG after a device removal 1179e2c xprtrdma: Fix
> > calculation of ri_max_send_sges
> >
> >>
> >> It will be one of three patches:
> >> 9d5f8c209b3f RDMA/core: Add resource tracking for create and destroy
> >> PDs 08f294a1524b RDMA/core: Add resource tracking for create and
> >> destroy CQs
> >> 78a0cd648a80 RDMA/core: Add resource tracking for create and destroy
> >> QPs
> > Do you think these could lead to a resource not being freed? Or only issues
> with tracking?
> >
> >>
> >>>
> >>> - iWARP, RoCE, or both?
> > Only tested over RoCE for now
> >>>
> >>> - Have you tried reproducing with a different model of device?
> > no
> >>
> >> I doubt that it is related to device, it looks like a resource leak
> >> while removing rpcrdma.
> >>
> >> We definitely need to add more information to this warning to
> >> understand which one of three available resources wasn't freed.
> >
> > Missed an output from our driver saying there's a PD not freed. As
> > mentioned, due to other Issues we're not sure whether we've seen this
> message from our driver in the past.
> 
> When I've tested device unload with rpcrdma.ko, the unload hangs if
> rpcrdma.ko doesn't release all resources.
> 
> rpcrdma_ia_remove() releases transport resources. It destroys the QP and
> CQs, but leaves the ID and PD to be destroyed by the device driver or core.
> The CM event handler returns 1 to signal this is the case.
> 
> I suspect it could be a driver bug.
Our driver doesn't take care of releasing PDs, it counts on layers above to do so. 
Why should the PD be treated differently than the CQs/QPs in this case? 
we will look into this further to understand whether this is newly introduced. 
thanks

> 
> 
> >>>> Thanks,
> >>>> Michal
> >>>>
> >>>> GAD17990 login: [  300.480137] ib_srpt srpt_remove_one(qedr0):
> >>>> nothing
> >> to do.
> >>>> [  300.515527] ib_srpt srpt_remove_one(qedr1): nothing to do.
> >>>> [  300.542182] rpcrdma: removing device qedr1 for
> >>>> 192.168.110.146:20049 [  300.573789] WARNING: CPU: 12 PID: 3545 at
> >>>> drivers/infiniband/core/restrack.c:20 rdma_restrack_clean+0x25/0x30
> >>>> [ib_core] [  300.625985] Modules linked in: rpcsec_gss_krb5 nfsv4
> >>>> dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser
> >>>> libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp
> >>>> scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
> >>>> rdma_cm ib_cm iw_cm 8021q garp mrp qedr(-) ib_core xt_CHECKSUM
> >>>> iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4
> iptable_nat
> >>>> nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
> >>>> nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc
> >>>> ebtable_filter ebtables fuse ip6table_filter ip6_tables
> >>>> iptable_filter dm_mirror dm_region_hash dm_log dm_mod vfat fat dax
> >>>> intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp
> coretemp
> >>>> kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul
> >>>> ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd
> >>>> ipmi_si [  300.972993]  iTCO_wdt ipmi_devintf sg pcspkr
> >> iTCO_vendor_support hpwdt hpilo lpc_ich ipmi_msghandler pcc_cpufreq
> >> ioatdma i2c_i801 mfd_core wmi shpchp dca acpi_power_meter i2c_core
> >> nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c
> >> sd_mod qede qed crc32c_intel tg3 hpsa scsi_transport_sas crc8 [
> 301.109036] CPU: 12 PID:
> >> 3545 Comm: rmmod Not tainted 4.16.0-rc1 #1 [  301.139518] Hardware
> name:
> >> HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 02/17/2017 [
> >> 301.180411] RIP: 0010:rdma_restrack_clean+0x25/0x30 [ib_core] [
> >> 301.208350] RSP: 0018:ffffb1820478fe88 EFLAGS: 00010286 [
> >> 301.233241]
> >> RAX: 0000000000000000 RBX: ffffa099ed1b4070 RCX: ffffdf02a193c800 [
> >> 301.268001] RDX: ffffa095ed12d7a0 RSI: 0000000000025900 RDI:
> >> ffffa099ed1b47d0 [  301.302530] RBP: ffffa099ed1b4070 R08:
> >> ffffa095de9dd000 R09: 0000000180080007 [  301.337245] R10:
> >> 0000000000000001 R11: ffffa095de9dd000 R12: ffffa099ed1b4000 [
> >> 301.372151] R13: ffffa099ed1b405c R14: 0000000000e231c0 R15:
> >> 0000000000e23010 [  301.407384] FS:  00007f2b0c854740(0000)
> >> GS:ffffa099ff700000(0000) knlGS:0000000000000000 [  301.447026] CS:
> >> 0010
> >> DS: 0000 ES: 0000 CR0: 0000000080050033 [  301.475409] CR2:
> >> 0000000000e2caf8 CR3: 0000000865c0d006 CR4: 00000000001606e0 [
> >> 301.510892] Call Trace:
> >>>> [  301.522715]  ib_unregister_device+0xf5/0x190 [ib_core] [
> >>>> 301.547966]  qedr_remove+0x37/0x60 [qedr] [  301.568393]
> >>>> qede_rdma_unregister_driver+0x4b/0x90 [qede] [  301.594980]
> >>>> SyS_delete_module+0x168/0x240 [  301.615057]
> >>>> do_syscall_64+0x6f/0x1a0 [  301.633588]
> >>>> entry_SYSCALL_64_after_hwframe+0x21/0x86
> >>>> [  301.658657] RIP: 0033:0x7f2b0bd33707 [  301.676005] RSP:
> >>>> 002b:00007ffdefa29d98 EFLAGS: 00000202 ORIG_RAX:
> 00000000000000b0
> >> [
> >>>> 301.713324] RAX: ffffffffffffffda RBX: 0000000000e231c0 RCX:
> >>>> 00007f2b0bd33707 [  301.748186] RDX: 00007f2b0bda3a80 RSI:
> >>>> 0000000000000800 RDI: 0000000000e23228 [  301.782960] RBP:
> >>>> 0000000000000000 R08: 00007f2b0bff8060 R09: 00007f2b0bda3a80 [
> >>>> 301.818142] R10: 00007ffdefa29b20 R11: 0000000000000202 R12:
> >>>> 00007ffdefa2b70d [  301.853290] R13: 0000000000000000 R14:
> >>>> 0000000000e231c0 R15: 0000000000e23010 [  301.888138] Code: 84 00
> >>>> 00
> >>>> 00 00 00 0f 1f 44 00 00 48 83 c7 28 31 c0 eb 0c 48 83 c0 08 48 3d
> >>>> 00
> >>>> 08 00 00 74 0f 48 8d 14 07 48 8b 12 48 85 d2 74 e8 <0f> ff c3 f3 c3
> >>>> 66 0f 1f 44 00 00 0f 1f 44 00 00 53 48 8b 47 28 [  301.981140] ---[
> >>>> end trace 28dec8f15205789a ]---
> >>>
> >>> --
> >>> Chuck Lever
> >>>
> >>>
> >>>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux