Re: [PATCH v5 3/5] nfsd: rework refcounting in filecache

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Wed, 18 Jan 2023 18:41:56 +0000

> On Jan 18, 2023, at 12:18 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote:
> 
> On Wed, 2023-01-18 at 16:48 +0000, Shachar Kagan wrote:
>> On Wend, 2023-01-18 at 18:45 +0000, Chuck Lever III wrote:
>> 
>>> On Tue, 2023-01-17 at 15:22 +0000, Chuck Lever III wrote:
>>>> 
>>>>> On Jan 17, 2023, at 10:16 AM, Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>>>>> 
>>>>> On Tue, Nov 01, 2022 at 10:46:45AM -0400, Jeff Layton wrote:
>>>>>> The filecache refcounting is a bit non-standard for something 
>>>>>> searchable by RCU, in that we maintain a sentinel reference while 
>>>>>> it's hashed. This in turn requires that we have to do things differently in the "put"
>>>>>> depending on whether its hashed, which we believe to have led to races.
>>>>>> 
>>>>>> There are other problems in here too. nfsd_file_close_inode_sync 
>>>>>> can end up freeing an nfsd_file while there are still outstanding 
>>>>>> references to it, and there are a number of subtle ToC/ToU races.
>>>>>> 
>>>>>> Rework the code so that the refcount is what drives the lifecycle. 
>>>>>> When the refcount goes to zero, then unhash and rcu free the object.
>>>>>> 
>>>>>> With this change, the LRU carries a reference. Take special care 
>>>>>> to deal with it when removing an entry from the list.
>>>>>> 
>>>>>> Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx>
>>>>> 
>>>>> Our test team is getting crashes that bisection pointed at this 
>>>>> patch. It seems like there are multiple parallel crash reports so 
>>>>> the whole thing is a mess to read:
>>>>> 
>>>>> [  875.548965] BUG: kernel NULL pointer dereference, address: 
>>>>> 00000000000000d0 [  875.548968] ------------[ cut here ]------------ 
>>>>> [  875.548972] refcount_t: underflow; use-after-free.
>>>>> [  875.548992] WARNING: CPU: 4 PID: 12145 at lib/refcount.c:28 
>>>>> refcount_warn_saturate+0xd8/0xe0 [  875.549851] #PF: supervisor read 
>>>>> access in kernel mode [  875.550158] Modules linked in:
>>>>> [  875.550752] #PF: error_code(0x0000) - not-present page [  
>>>>> 875.551269]  nfsd [  875.551878] PGD 0 [  875.552069]  iptable_raw [  
>>>>> 875.552677] P4D 0 [  875.552824]  bonding mlx5_vfio_pci [  
>>>>> 875.553095] [  875.553255]  rdma_ucm ipip [  875.553525] Oops: 0000 
>>>>> [#1] SMP [  875.553733]  tunnel4 [  875.553941] CPU: 0 PID: 12147 
>>>>> Comm: nfsd Not tainted 6.1.0-rc7_ac3a2585f018 #1 [  875.554109]  
>>>>> ip_gre ib_umad [  875.554517] Hardware name: QEMU Standard PC (Q35 + 
>>>>> ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 
>>>>> 04/01/2014 [  875.554656]  nf_tables vfio_pci [  875.555508] RIP: 
>>>>> 0010:vfs_setlease+0x27/0x70 [  875.555695]  vfio_pci_core 
>>>>> vfio_virqfd [  875.557015] Code: ff ff 90 0f 1f 44 00 00 41 54 49 89 
>>>>> d4 55 48 89 fd 48 83 ec 10 48 85 d2 74 06 48 83 fe 02 75 1f 48 8b 45 
>>>>> 28 4c 89 e2 48 89 ef <48> 8b 80 d0 00 00 00 48 85 c0 74 2c 48 83 c4 
>>>>> 10 5d 41 5c ff e0 48 [  875.557209]  vfio_iommu_type1 [  875.557406] 
>>>>> RSP: 0018:ffff88810378bdb0 EFLAGS: 00010246 [  875.557634]  mlx5_ib 
>>>>> [  875.558446] [  875.558628]  vfio [  875.558862] RAX: 
>>>>> 0000000000000000 RBX: ffff88824866c000 RCX: ffff88810378bdd8 [  
>>>>> 875.559006]  ib_uverbs [  875.559092] RDX: 0000000000000000 RSI: 
>>>>> 0000000000000002 RDI: ffff88812560a200 [  875.559218]  ib_ipoib [  
>>>>> 875.559557] RBP: ffff88812560a200 R08: ffff8881da5ecf00 R09: 
>>>>> ffffffff824064e0 [  875.559704]  mlx5_core [  875.560021] R10: 
>>>>> 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [  
>>>>> 875.560165]  ip6_gre [  875.560488] R13: ffff8881da5ecf00 R14: 
>>>>> ffff888110e62028 R15: ffff888110e621a0 [  875.560634]  gre [  
>>>>> 875.560959] FS:  0000000000000000(0000) GS:ffff88852c800000(0000) 
>>>>> knlGS:0000000000000000 [  875.561108]  ip6_tunnel [  875.561432] CS:  
>>>>> 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [  875.561554]  tunnel6 
>>>>> [  875.561928] CR2: 00000000000000d0 CR3: 00000001ca27d001 CR4: 
>>>>> 0000000000372eb0 [  875.562084]  geneve [  875.562349] DR0: 
>>>>> 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [  
>>>>> 875.562493]  nfnetlink_cttimeout [  875.562822] DR3: 
>>>>> 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [  
>>>>> 875.562962]  openvswitch [  875.563292] Call Trace:
>>>>> [  875.563298]  <TASK>
>>>>> [  875.563503]  nsh
>>>>> [  875.563839]  destroy_unhashed_deleg+0x58/0xc0 [nfsd]
>>>> 
>>>> We are aware of this failure mode. Actually this started well before 
>>>> that particular commit.
>>>> 
>>>> Our problem has been that no one has been able to provide a reliable 
>>>> reproducer, so we can't figure out why it's happening. If you have a 
>>>> way to reproduce this failure reliably, can you capture a vmcore or 
>>>> enable KASAN and get a little more information?
>>>> 
>>> 
>>> It's possible that this crash may be related to the problem that was fixed here:
>>> 
>>>   commit 0b3a551fa58b4da941efeb209b3770868e2eddd7
>>>   Author: Jeff Layton <jlayton@xxxxxxxxxx>
>>>   Date:   Thu Jan 5 14:55:56 2023 -0500
>>> 
>>>       nfsd: fix handling of cached open files in nfsd4_open codepath
>>> 
>>> Unfortunately, that hasn't trickled into v6.1 kernels so far.
>> 
>> This commit is in my working tree, but this commit doesn't fix the issue since I still face the crash.
>> We are working on v6.2-rc3
> 
> Thanks for testing it. That patch fixes a real bug, just not the one
> you're hitting apparently.
> 
> If you're comfortable working with bleeding edge kernels, you may just
> want to pull in Chuck's for-rc and for-next branches.

Stephen and I renamed these yesterday to nfsd-fixes and nfsd-next,
respectively.

nfsd-fixes was pulled into v6.2-rc yesterday, fyi.

> Those have a few
> other patches that I wouldn't expect to change this, but might still be
> worth testing to see.
> 
> If it's happening regularly, you could also try disabling leases on the
> machine, at the expense of some performance. We suspect this is related
> to delegation handling, but we just haven't been able to nail it down
> yet. If you do that, and it seems to fix it for you, let us know as that
> would be an interesting datapoint.
> 
> Thanks!
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>

--
Chuck Lever