Re: Leaked nfsd_file due to race condition and early unhash (fs/nfsd/filecache.c)

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Wed, 3 Jul 2024 20:57:17 +0000

> On Jul 3, 2024, at 4:46 PM, Youzhong Yang <youzhong@xxxxxxxxx> wrote:
> 
> Thank you Chuck. Here are my quick answers to your comments:
> 
> - I don't have a quick reproducer. I reproduced it by using hundreds
> of nfs clients generating +600K ops under our workload in the testing
> environment. Theoretically it should be possible to simplify the
> reproduction but I am still working on it.
> 
> -  I understand zfs is an out-of-tree file system. That's fine. But
> this leaking can happen to any file system, and leaking is not a good
> thing no matter what file system it is.

No doubt, but we need to establish that this is not an issue
in the Linux port of zfs.

> -  I will try to come up with a reproducer using xfs or btrfs if possible.

Thanks. A reliable reproducer will demonstrate very clearly that
the bug is addressed when the (final) fix has been applied.

> Now back to the problem itself, here are my findings:
> 
> - nfsd_file_put() in one thread can race with another thread doing
> garbage collection (running nfsd_file_gc() -> list_lru_walk() ->
> nfsd_file_lru_cb()):
> 
>  * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add().
>  * nfsd_file_lru_add() returns true (thus NFSD_FILE_REFERENCED bit
> set for nf->nf_flags)
>  * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED
> bit and returns LRU_ROTATE.
>  * garbage collector kicks in again, nfsd_file_lru_cb() now
> decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes it from
> the LRU and adds to the dispose list [list_lru_isolate_move(lru,
> &nf->nf_lru, head);]
>  * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it
> tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]
>  * The 'nf' has been added to the 'dispose' list by
> nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats it as
> part of the LRU and removes it, which leads it to be removed from the
> 'dispose' list.
>  * At this moment, nf->nf_ref is 0, it's unhashed, and not on the
> LRU. nfsd_file_put() continues its execution [if
> (refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is already 0, now
> bad thing happens: nf->nf_ref is set to REFCOUNT_SATURATED, and the
> 'nf' is leaked.
> 
> To make this happen, the right timing is crucial. It can be reproduced
> by adding artifical delays in filecache.c, or hammering the nfsd with
> tons of ops.
> 
> - Let's see how nfsd_file_put() can race with nfsd_file_cond_queue():
>  * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add().
>  * nfsd_file_lru_add() sets REFERENCED bit and returns true.
>  * 'exportfs -f' or something like that triggers
> __nfsd_file_cache_purge() -> nfsd_file_cond_queue().
>  * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))],
> unhash is done successfully.
>  * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now
> nf->nf_ref goes to 2.
>  * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds.
>  * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement,
> &nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref goes to
> 0, the 'nf' is added to the dispost list [list_add(&nf->nf_lru,
> dispose)]
>  * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it
> tries to remove the 'nf' from the LRU [if
> (!nfsd_file_lru_remove(nf))], although the 'nf' is not in the LRU, but
> it is linked in the 'dispose' list, nfsd_file_lru_remove() simply
> treats it as part of the LRU and removes it. This leads to its removal
> from the 'dispose' list!
>  * Now nf->ref is 0, unhashed. nfsd_file_put() continues its
> execution and sets nf->nf_ref to REFCOUNT_SATURATED.
> 
> The purpose of nf->nf_lru is problematic. As you can see, it is used
> for the LRU list, and also the 'dispose' list. Adding another 'struct
> list_head' specifically for the 'dispose' list seems to be a better
> way of fixing this race condition. Either way works for me.

Generally this double-use of nf_lru has been determined to be
safe. If there is a bug in the list manipulation logic, it will
still be there if there are two list_head fields; it will simply
be masked.

> Would you agree my above analysis makes sense? Thanks.

Jeff is the expert in this area. I will also study it.

> Here is my patch with signed-off-by:
> 
> From: Youzhong Yang <youzhong@xxxxxxxxx>
> Date: Mon, 1 Jul 2024 06:45:22 -0400
> Subject: [PATCH] nfsd: fix nfsd_file leaking due to race condition and early
> unhash

Much better, but you also need a patch description. Two
paragraphs that explain the bug and the proposed solution
should be adequate.

> Signed-off-by: Youzhong Yang <youzhong@xxxxxxxxx>
> ---
> fs/nfsd/filecache.c | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 1a6d5d000b85..2323829f7208 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf)
>                        if (!nfsd_file_lru_remove(nf))
>                                return;
>                }
> +               /*
> +                * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(),
> +                * it's unhashed but then removed from the dispose list,
> +                * so we need to free it.
> +                */
> +               if (refcount_read(&nf->nf_ref) == 0 &&
> +                   !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) &&
> +                   list_empty(&nf->nf_lru)) {
> +                       nfsd_file_free(nf);
> +                       return;
> +               }
>        }
>        if (refcount_dec_and_test(&nf->nf_ref))
>                nfsd_file_free(nf);
> @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct
> list_head *dispose)
>        int decrement = 1;
> 
>        /* If we raced with someone else unhashing, ignore it */
> -       if (!nfsd_file_unhash(nf))
> +       if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
>                return;
> 
>        /* If we can't get a reference, ignore it */
> @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct
> list_head *dispose)
>        /* If refcount goes to 0, then put on the dispose list */
>        if (refcount_sub_and_test(decrement, &nf->nf_ref)) {
>                list_add(&nf->nf_lru, dispose);
> +               nfsd_file_unhash(nf);
>                trace_nfsd_file_closing(nf);
>        }
> }
> --
> 2.43.0
> 
> On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>> 
>> On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang wrote:
>>> Hello,
>>> 
>>> I'd like to report a nfsd_file leaking issue and propose a fix for it.
>>> 
>>> When I tested Linux kernel 6.8 and 6.6, I noticed nfsd_file leaks
>>> which led to undestroyable file systems (zfs),
>> 
>> Thanks for the report. Some initial comments:
>> 
>> - Do you have a specific reproducer? In other words, what is the
>>  simplest program that can run on an NFS client that will trigger
>>  this leak, and can you post it?
>> 
>> - "zfs" is an out-of-tree file system, so it's not directly
>>  supported for NFSD.
>> 
>> - The guidelines for patch submission require us to fix issues in
>>  upstream Linux first (currently that's v6.10-rc6). Then that fix
>>  can be backported to older stable kernels like 6.6.
>> 
>> Can you reproduce the leak with one of the in-kernel filesystems
>> (either xfs or btrfs would be great) and with NFSD in 6.10-rc6?
>> 
>> One more comment below.
>> 
>> 
>>> here are some examples:
>>> 
>>> crash> struct nfsd_file -x ffff88e160db0460
>>> struct nfsd_file {
>>>  nf_rlist = {
>>>    rhead = {
>>>      next = 0xffff8921fa2392f1
>>>    },
>>>    next = 0x0
>>>  },
>>>  nf_inode = 0xffff8882bc312ef8,
>>>  nf_file = 0xffff88e2015b1500,
>>>  nf_cred = 0xffff88e3ab0e7800,
>>>  nf_net = 0xffffffff83d41600 <init_net>,
>>>  nf_flags = 0x8,
>>>  nf_ref = {
>>>    refs = {
>>>      counter = 0xc0000000
>>>    }
>>>  },
>>>  nf_may = 0x4,
>>>  nf_mark = 0xffff88e1bddfb320,
>>>  nf_lru = {
>>>    next = 0xffff88e160db04a8,
>>>    prev = 0xffff88e160db04a8
>>>  },
>>>  nf_rcu = {
>>>    next = 0x10000000000,
>>>    func = 0x0
>>>  },
>>>  nf_birthtime = 0x73d22fc1728
>>> }
>>> 
>>> crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x
>>> ffff88839a53d850
>>>  nf_flags = 0x8,
>>>  nf_ref.refs.counter = 0x0
>>>  nf_lru = {
>>>    next = 0xffff88839a53d898,
>>>    prev = 0xffff88839a53d898
>>>  },
>>>  nf_file = 0xffff88810ede8700,
>>> 
>>> crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x
>>> ffff88c32b11e850
>>>  nf_flags = 0x8,
>>>  nf_ref.refs.counter = 0x0
>>>  nf_lru = {
>>>    next = 0xffff88c32b11e898,
>>>    prev = 0xffff88c32b11e898
>>>  },
>>>  nf_file = 0xffff88c20a701c00,
>>> 
>>> crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x
>>> ffff88e372709700
>>>  nf_flags = 0xc,
>>>  nf_ref.refs.counter = 0x0
>>>  nf_lru = {
>>>    next = 0xffff88e372709748,
>>>    prev = 0xffff88e372709748
>>>  },
>>>  nf_file = 0xffff88e0725e6400,
>>> 
>>> crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x
>>> ffff8982864944d0
>>>  nf_flags = 0xc,
>>>  nf_ref.refs.counter = 0x0
>>>  nf_lru = {
>>>    next = 0xffff898286494518,
>>>    prev = 0xffff898286494518
>>>  },
>>>  nf_file = 0xffff89803c0ff700,
>>> 
>>> The leak occurs when nfsd_file_put() races with nfsd_file_cond_queue()
>>> or nfsd_file_lru_cb(). With the following patch, I haven't observed
>>> any leak after a few days heavy nfs load:
>> 
>> Our patch submission guidelines require a Signed-off-by:
>> line at the end of the patch description. See the "Sign your work -
>> the Developer's Certificate of Origin" section of
>> 
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6
>> 
>> (Needed here in case your fix is acceptable).
>> 
>> 
>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>> index 1a6d5d000b85..2323829f7208 100644
>>> --- a/fs/nfsd/filecache.c
>>> +++ b/fs/nfsd/filecache.c
>>> @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf)
>>>  if (!nfsd_file_lru_remove(nf))
>>>  return;
>>>  }
>>> + /*
>>> + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(),
>>> + * it's unhashed but then removed from the dispose list,
>>> + * so we need to free it.
>>> + */
>>> + if (refcount_read(&nf->nf_ref) == 0 &&
>>> +     !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) &&
>>> +     list_empty(&nf->nf_lru)) {
>>> + nfsd_file_free(nf);
>>> + return;
>>> + }
>>>  }
>>>  if (refcount_dec_and_test(&nf->nf_ref))
>>>  nfsd_file_free(nf);
>>> @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct
>>> list_head *dispose)
>>>  int decrement = 1;
>>> 
>>>  /* If we raced with someone else unhashing, ignore it */
>>> - if (!nfsd_file_unhash(nf))
>>> + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
>>>  return;
>>> 
>>>  /* If we can't get a reference, ignore it */
>>> @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct
>>> list_head *dispose)
>>>  /* If refcount goes to 0, then put on the dispose list */
>>>  if (refcount_sub_and_test(decrement, &nf->nf_ref)) {
>>>  list_add(&nf->nf_lru, dispose);
>>> + nfsd_file_unhash(nf);
>>>  trace_nfsd_file_closing(nf);
>>>  }
>>> }
>>> 
>>> Please kindly review the patch and let me know if it makes sense.
>>> 
>>> Thanks,
>>> 
>>> -Youzhong
>>> 
>> 
>> --
>> Chuck Lever

--
Chuck Lever