Re: [PATCH] svcauth_gss: Revert 64c59a3726f2 ("Remove unnecessary allocation")

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 6 Sep 2016 16:49:33 -0400

> On Sep 6, 2016, at 4:42 PM, J. Bruce Fields <bfields@xxxxxxxxxxxx> wrote:
> 
> On Thu, Sep 01, 2016 at 10:50:38AM -0400, Chuck Lever wrote:
>> On occasion I see this on my server:
>> 
>> kernel: kernel BUG at /home/cel/src/linux/linux-2.6/mm/slub.c:3851!
>> kernel: invalid opcode: 0000 [#1] SMP
>> kernel: Modules linked in: cts rpcsec_gss_krb5 sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd btrfs xor iTCO_wdt iTCO_vendor_support raid6_pq pcspkr i2c_i801 i2c_smbus lpc_ich mfd_core mei_me sg mei shpchp wmi ioatdma ipmi_si ipmi_msghandler acpi_pad acpi_power_meter rpcrdma ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm nfsd nfs_acl lockd grace auth_rpcgss sunrpc ip_tables xfs libcrc32c mlx4_ib mlx4_en ib_core sr_mod cdrom sd_mod ast drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel igb mlx4_core ahci libahci libata ptp pps_core dca i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod
>> kernel: CPU: 7 PID: 145 Comm: kworker/7:2 Not tainted 4.8.0-rc4-00006-g9d06b0b #15
>> kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
>> kernel: Workqueue: events do_cache_clean [sunrpc]
>> kernel: task: ffff8808541d8000 task.stack: ffff880854344000
>> kernel: RIP: 0010:[<ffffffff811e7075>]  [<ffffffff811e7075>] kfree+0x155/0x180
>> kernel: RSP: 0018:ffff880854347d70  EFLAGS: 00010246
>> kernel: RAX: ffffea0020fe7660 RBX: ffff88083f9db064 RCX: 146ff0f9d5ec5600
>> kernel: RDX: 000077ff80000000 RSI: ffff880853f01500 RDI: ffff88083f9db064
>> kernel: RBP: ffff880854347d88 R08: ffff8808594ee000 R09: ffff88087fdd8780
>> kernel: R10: 0000000000000000 R11: ffffea0020fe76c0 R12: ffff880853f01500
>> kernel: R13: ffffffffa013cf76 R14: ffffffffa013cff0 R15: ffffffffa04253a0
>> kernel: FS:  0000000000000000(0000) GS:ffff88087fdc0000(0000) knlGS:0000000000000000
>> kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> kernel: CR2: 00007fed60b020c3 CR3: 0000000001c06000 CR4: 00000000001406e0
>> kernel: Stack:
>> kernel: ffff8808589f2f00 ffff880853f01500 0000000000000001 ffff880854347da0
>> kernel: ffffffffa013cf76 ffff8808589f2f00 ffff880854347db8 ffffffffa013d006
>> kernel: ffff8808589f2f20 ffff880854347e00 ffffffffa0406f60 0000000057c7044f
>> kernel: Call Trace:
>> kernel: [<ffffffffa013cf76>] rsc_free+0x16/0x90 [auth_rpcgss]
>> kernel: [<ffffffffa013d006>] rsc_put+0x16/0x30 [auth_rpcgss]
>> kernel: [<ffffffffa0406f60>] cache_clean+0x2e0/0x300 [sunrpc]
>> kernel: [<ffffffffa04073ee>] do_cache_clean+0xe/0x70 [sunrpc]
>> kernel: [<ffffffff8109a70f>] process_one_work+0x1ff/0x3b0
>> kernel: [<ffffffff8109b15c>] worker_thread+0x2bc/0x4a0
>> kernel: [<ffffffff8109aea0>] ? rescuer_thread+0x3a0/0x3a0
>> kernel: [<ffffffff810a0ba4>] kthread+0xe4/0xf0
>> kernel: [<ffffffff8169c47f>] ret_from_fork+0x1f/0x40
>> kernel: [<ffffffff810a0ac0>] ? kthread_stop+0x110/0x110
>> kernel: Code: f7 ff ff eb 3b 65 8b 05 da 30 e2 7e 89 c0 48 0f a3 05 a0 38 b8 00 0f 92 c0 84 c0 0f 85 d1 fe ff ff 0f 1f 44 00 00 e9 f5 fe ff ff <0f> 0b 49 8b 03 31 f6 f6 c4 40 0f 85 62 ff ff ff e9 61 ff ff ff
>> kernel: RIP  [<ffffffff811e7075>] kfree+0x155/0x180
>> kernel: RSP <ffff880854347d70>
>> kernel: ---[ end trace 3fdec044969def26 ]---
>> 
>> It seems to be most common after a server reboot where a client has been
>> using a Kerberos mount, and reconnects to continue its workload.
>> 
>> Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx>
>> ---
>> Hi Bruce-
>> 
>> I've hit this off-and-on oops since 64c59a3726f2 was merged.
>> Reverting it seems to make the problem go away. I haven't dug
>> into why.
>> 
>> Thoughts?
> 
> Apologies, I wasn't thinking when I wrote that patch.  The problem is
> probably that rsc_lookup steals the passed-in memory to avoid doing an
> allocation of its own, so we can't just pass in a pointer to memory that
> someone else is using....
> 
> If we really want to avoid allocation there then maybe we should
> preallocate somwhere, or reference count these handles.
> 
> For now reverting sounds like the right thing to do.

NP, thanks for confirming!

> Ben, did you ever confirm whether this helped with the problem you were
> seeing?  (If I remember correctly, unnpredictable delays here could
> cause the request to be dropped if later requests push the rpcsec_gss
> sequence window too far.)  If so then we could look into reference
> counting.

Well that's interesting.

When a request is dropped, would the server disconnect? Because if it
doesn't, the client will wait forever.

> --b.
> 
>> 
>> 
>> net/sunrpc/auth_gss/svcauth_gss.c |    5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/net/sunrpc/auth_gss/svcauth_gss.c b/net/sunrpc/auth_gss/svcauth_gss.c
>> index 1d28181..d858202 100644
>> --- a/net/sunrpc/auth_gss/svcauth_gss.c
>> +++ b/net/sunrpc/auth_gss/svcauth_gss.c
>> @@ -569,9 +569,10 @@ gss_svc_searchbyctx(struct cache_detail *cd, struct xdr_netobj *handle)
>> 	struct rsc *found;
>> 
>> 	memset(&rsci, 0, sizeof(rsci));
>> -	rsci.handle.data = handle->data;
>> -	rsci.handle.len = handle->len;
>> +	if (dup_to_netobj(&rsci.handle, handle->data, handle->len))
>> +		return NULL;
>> 	found = rsc_lookup(cd, &rsci);
>> +	rsc_free(&rsci);
>> 	if (!found)
>> 		return NULL;
>> 	if (cache_check(cd, &found->h, NULL))

--
Chuck Lever

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html