Re: rdma resource warning on 4.16-rc1 when unloading qedr after NFS mount

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Feb 14, 2018, at 11:49 AM, Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx> wrote:
> 
>> From: Leon Romanovsky [mailto:leon@xxxxxxxxxx]
>> Sent: Wednesday, February 14, 2018 6:34 PM
>> To: Chuck Lever <chuck.lever@xxxxxxxxxx>
>> Cc: Kalderon, Michal <Michal.Kalderon@xxxxxxxxxx>; Le, Thong
>> <Thong.Le@xxxxxxxxxx>; linux-rdma@xxxxxxxxxxxxxxx
>> Subject: Re: rdma resource warning on 4.16-rc1 when unloading qedr after
>> NFS mount
>> 
>> On Wed, Feb 14, 2018 at 11:20:39AM -0500, Chuck Lever wrote:
>>> 
>>> 
>>>> On Feb 14, 2018, at 11:00 AM, Kalderon, Michal
>> <Michal.Kalderon@xxxxxxxxxx> wrote:
>>>> 
>>>> Hi Leon, Chuck,
>>>> 
>>>> We ran nfs mount over qedr using 4.16-rc1 When unloading qedr we get
>>>> a WARNING from the resource tracker ( pasted below)
>>>> 
>>>> Can you please advise on the best way to debug this? How can we get
>> more info on the resource not being freed?
>>> 
>>> I haven't seen this kind of report before, so I can't directly answer
>>> your questions. But can you tell us more about reproducing it:
>> 
>> It is resource tracking which was entered in last merge window.
>> 
>>> 
>>> - Is there a workload running on the NFS mount point when the module
>>> is unloaded?
> no
>>> 
>>> - Is the issue 100% reproducible, or intermittent?
> Seems to be
>>> 
>>> - Have you tried bisecting?
> No, bisecting is a tough one here since we ran this scenario to verify the last
> Two related nfs fixes 
> e89e8d8 xprtrdma: Fix BUG after a device removal
> 1179e2c xprtrdma: Fix calculation of ri_max_send_sges
> 
>> 
>> It will be one of three patches:
>> 9d5f8c209b3f RDMA/core: Add resource tracking for create and destroy PDs
>> 08f294a1524b RDMA/core: Add resource tracking for create and destroy CQs
>> 78a0cd648a80 RDMA/core: Add resource tracking for create and destroy QPs
> Do you think these could lead to a resource not being freed? Or only issues with tracking?
> 
>> 
>>> 
>>> - iWARP, RoCE, or both?
> Only tested over RoCE for now
>>> 
>>> - Have you tried reproducing with a different model of device?
> no
>> 
>> I doubt that it is related to device, it looks like a resource leak while removing
>> rpcrdma.
>> 
>> We definitely need to add more information to this warning to understand
>> which one of three available resources wasn't freed.
> 
> Missed an output from our driver saying there's a PD not freed. As mentioned, due to other
> Issues we're not sure whether we've seen this message from our driver in the past. 

When I've tested device unload with rpcrdma.ko, the unload hangs
if rpcrdma.ko doesn't release all resources.

rpcrdma_ia_remove() releases transport resources. It destroys the
QP and CQs, but leaves the ID and PD to be destroyed by the device
driver or core. The CM event handler returns 1 to signal this is
the case.

I suspect it could be a driver bug.


>>>> Thanks,
>>>> Michal
>>>> 
>>>> GAD17990 login: [  300.480137] ib_srpt srpt_remove_one(qedr0): nothing
>> to do.
>>>> [  300.515527] ib_srpt srpt_remove_one(qedr1): nothing to do.
>>>> [  300.542182] rpcrdma: removing device qedr1 for
>>>> 192.168.110.146:20049 [  300.573789] WARNING: CPU: 12 PID: 3545 at
>>>> drivers/infiniband/core/restrack.c:20 rdma_restrack_clean+0x25/0x30
>>>> [ib_core] [  300.625985] Modules linked in: rpcsec_gss_krb5 nfsv4
>>>> dns_resolver nfs fscache rpcrdma ib_isert iscsi_target_mod ib_iser
>>>> libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp
>>>> scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad
>>>> rdma_cm ib_cm iw_cm 8021q garp mrp qedr(-) ib_core xt_CHECKSUM
>>>> iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat
>>>> nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
>>>> nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc
>>>> ebtable_filter ebtables fuse ip6table_filter ip6_tables
>>>> iptable_filter dm_mirror dm_region_hash dm_log dm_mod vfat fat dax
>>>> intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp
>>>> kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul
>>>> ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd
>>>> ipmi_si [  300.972993]  iTCO_wdt ipmi_devintf sg pcspkr
>> iTCO_vendor_support hpwdt hpilo lpc_ich ipmi_msghandler pcc_cpufreq
>> ioatdma i2c_i801 mfd_core wmi shpchp dca acpi_power_meter i2c_core nfsd
>> auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod qede
>> qed crc32c_intel tg3 hpsa scsi_transport_sas crc8 [  301.109036] CPU: 12 PID:
>> 3545 Comm: rmmod Not tainted 4.16.0-rc1 #1 [  301.139518] Hardware name:
>> HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 02/17/2017 [
>> 301.180411] RIP: 0010:rdma_restrack_clean+0x25/0x30 [ib_core] [
>> 301.208350] RSP: 0018:ffffb1820478fe88 EFLAGS: 00010286 [  301.233241]
>> RAX: 0000000000000000 RBX: ffffa099ed1b4070 RCX: ffffdf02a193c800 [
>> 301.268001] RDX: ffffa095ed12d7a0 RSI: 0000000000025900 RDI:
>> ffffa099ed1b47d0 [  301.302530] RBP: ffffa099ed1b4070 R08:
>> ffffa095de9dd000 R09: 0000000180080007 [  301.337245] R10:
>> 0000000000000001 R11: ffffa095de9dd000 R12: ffffa099ed1b4000 [
>> 301.372151] R13: ffffa099ed1b405c R14: 0000000000e231c0 R15:
>> 0000000000e23010 [  301.407384] FS:  00007f2b0c854740(0000)
>> GS:ffffa099ff700000(0000) knlGS:0000000000000000 [  301.447026] CS:  0010
>> DS: 0000 ES: 0000 CR0: 0000000080050033 [  301.475409] CR2:
>> 0000000000e2caf8 CR3: 0000000865c0d006 CR4: 00000000001606e0 [
>> 301.510892] Call Trace:
>>>> [  301.522715]  ib_unregister_device+0xf5/0x190 [ib_core] [
>>>> 301.547966]  qedr_remove+0x37/0x60 [qedr] [  301.568393]
>>>> qede_rdma_unregister_driver+0x4b/0x90 [qede] [  301.594980]
>>>> SyS_delete_module+0x168/0x240 [  301.615057]
>>>> do_syscall_64+0x6f/0x1a0 [  301.633588]
>>>> entry_SYSCALL_64_after_hwframe+0x21/0x86
>>>> [  301.658657] RIP: 0033:0x7f2b0bd33707 [  301.676005] RSP:
>>>> 002b:00007ffdefa29d98 EFLAGS: 00000202 ORIG_RAX: 00000000000000b0
>> [
>>>> 301.713324] RAX: ffffffffffffffda RBX: 0000000000e231c0 RCX:
>>>> 00007f2b0bd33707 [  301.748186] RDX: 00007f2b0bda3a80 RSI:
>>>> 0000000000000800 RDI: 0000000000e23228 [  301.782960] RBP:
>>>> 0000000000000000 R08: 00007f2b0bff8060 R09: 00007f2b0bda3a80 [
>>>> 301.818142] R10: 00007ffdefa29b20 R11: 0000000000000202 R12:
>>>> 00007ffdefa2b70d [  301.853290] R13: 0000000000000000 R14:
>>>> 0000000000e231c0 R15: 0000000000e23010 [  301.888138] Code: 84 00 00
>>>> 00 00 00 0f 1f 44 00 00 48 83 c7 28 31 c0 eb 0c 48 83 c0 08 48 3d 00
>>>> 08 00 00 74 0f 48 8d 14 07 48 8b 12 48 85 d2 74 e8 <0f> ff c3 f3 c3
>>>> 66 0f 1f 44 00 00 0f 1f 44 00 00 53 48 8b 47 28 [  301.981140] ---[
>>>> end trace 28dec8f15205789a ]---
>>> 
>>> --
>>> Chuck Lever
>>> 
>>> 
>>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux