Re: Leaked nfsd_file due to race condition and early unhash (fs/nfsd/filecache.c)

Youzhong Yang <youzhong@xxxxxxxxx> · Wed, 3 Jul 2024 16:46:45 -0400

Thank you Chuck. Here are my quick answers to your comments:

- I don't have a quick reproducer. I reproduced it by using hundreds
of nfs clients generating +600K ops under our workload in the testing
environment. Theoretically it should be possible to simplify the
reproduction but I am still working on it.

-  I understand zfs is an out-of-tree file system. That's fine. But
this leaking can happen to any file system, and leaking is not a good
thing no matter what file system it is.

-  I will try to come up with a reproducer using xfs or btrfs if possible.

Now back to the problem itself, here are my findings:

- nfsd_file_put() in one thread can race with another thread doing
garbage collection (running nfsd_file_gc() -> list_lru_walk() ->
nfsd_file_lru_cb()):

  * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add().
  * nfsd_file_lru_add() returns true (thus NFSD_FILE_REFERENCED bit
set for nf->nf_flags)
  * garbage collector kicks in, nfsd_file_lru_cb() clears REFERENCED
bit and returns LRU_ROTATE.
  * garbage collector kicks in again, nfsd_file_lru_cb() now
decrements nf->nf_ref to 0, runs nfsd_file_unhash(), removes it from
the LRU and adds to the dispose list [list_lru_isolate_move(lru,
&nf->nf_lru, head);]
  * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it
tries to remove the 'nf' from the LRU [if (!nfsd_file_lru_remove(nf))]
  * The 'nf' has been added to the 'dispose' list by
nfsd_file_lru_cb(), so nfsd_file_lru_remove(nf) simply treats it as
part of the LRU and removes it, which leads it to be removed from the
'dispose' list.
  * At this moment, nf->nf_ref is 0, it's unhashed, and not on the
LRU. nfsd_file_put() continues its execution [if
(refcount_dec_and_test(&nf->nf_ref))], as nf->nf_ref is already 0, now
bad thing happens: nf->nf_ref is set to REFCOUNT_SATURATED, and the
'nf' is leaked.

To make this happen, the right timing is crucial. It can be reproduced
by adding artifical delays in filecache.c, or hammering the nfsd with
tons of ops.

- Let's see how nfsd_file_put() can race with nfsd_file_cond_queue():
  * In nfsd_file_put(), nf->nf_ref is 1, so it tries to do nfsd_file_lru_add().
  * nfsd_file_lru_add() sets REFERENCED bit and returns true.
  * 'exportfs -f' or something like that triggers
__nfsd_file_cache_purge() -> nfsd_file_cond_queue().
  * In nfsd_file_cond_queue(), it runs [if (!nfsd_file_unhash(nf))],
unhash is done successfully.
  * nfsd_file_cond_queue() runs [if (!nfsd_file_get(nf))], now
nf->nf_ref goes to 2.
  * nfsd_file_cond_queue() runs [if (nfsd_file_lru_remove(nf))], it succeeds.
  * nfsd_file_cond_queue() runs [if (refcount_sub_and_test(decrement,
&nf->nf_ref))] (with "decrement" being 2), so the nf->nf_ref goes to
0, the 'nf' is added to the dispost list [list_add(&nf->nf_lru,
dispose)]
  * nfsd_file_put() detects NFSD_FILE_HASHED bit is cleared, so it
tries to remove the 'nf' from the LRU [if
(!nfsd_file_lru_remove(nf))], although the 'nf' is not in the LRU, but
it is linked in the 'dispose' list, nfsd_file_lru_remove() simply
treats it as part of the LRU and removes it. This leads to its removal
from the 'dispose' list!
  * Now nf->ref is 0, unhashed. nfsd_file_put() continues its
execution and sets nf->nf_ref to REFCOUNT_SATURATED.

The purpose of nf->nf_lru is problematic. As you can see, it is used
for the LRU list, and also the 'dispose' list. Adding another 'struct
list_head' specifically for the 'dispose' list seems to be a better
way of fixing this race condition. Either way works for me.

Would you agree my above analysis makes sense? Thanks.

Here is my patch with signed-off-by:

From: Youzhong Yang <youzhong@xxxxxxxxx>
Date: Mon, 1 Jul 2024 06:45:22 -0400
Subject: [PATCH] nfsd: fix nfsd_file leaking due to race condition and early
 unhash

Signed-off-by: Youzhong Yang <youzhong@xxxxxxxxx>
---
 fs/nfsd/filecache.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 1a6d5d000b85..2323829f7208 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf)
                        if (!nfsd_file_lru_remove(nf))
                                return;
                }
+               /*
+                * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(),
+                * it's unhashed but then removed from the dispose list,
+                * so we need to free it.
+                */
+               if (refcount_read(&nf->nf_ref) == 0 &&
+                   !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) &&
+                   list_empty(&nf->nf_lru)) {
+                       nfsd_file_free(nf);
+                       return;
+               }
        }
        if (refcount_dec_and_test(&nf->nf_ref))
                nfsd_file_free(nf);
@@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct
list_head *dispose)
        int decrement = 1;

        /* If we raced with someone else unhashing, ignore it */
-       if (!nfsd_file_unhash(nf))
+       if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
                return;

        /* If we can't get a reference, ignore it */
@@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct
list_head *dispose)
        /* If refcount goes to 0, then put on the dispose list */
        if (refcount_sub_and_test(decrement, &nf->nf_ref)) {
                list_add(&nf->nf_lru, dispose);
+               nfsd_file_unhash(nf);
                trace_nfsd_file_closing(nf);
        }
 }
--
2.43.0

On Wed, Jul 3, 2024 at 2:21 PM Chuck Lever <chuck.lever@xxxxxxxxxx> wrote:
>
> On Wed, Jul 03, 2024 at 10:12:33AM -0400, Youzhong Yang wrote:
> > Hello,
> >
> > I'd like to report a nfsd_file leaking issue and propose a fix for it.
> >
> > When I tested Linux kernel 6.8 and 6.6, I noticed nfsd_file leaks
> > which led to undestroyable file systems (zfs),
>
> Thanks for the report. Some initial comments:
>
> - Do you have a specific reproducer? In other words, what is the
>   simplest program that can run on an NFS client that will trigger
>   this leak, and can you post it?
>
> - "zfs" is an out-of-tree file system, so it's not directly
>   supported for NFSD.
>
> - The guidelines for patch submission require us to fix issues in
>   upstream Linux first (currently that's v6.10-rc6). Then that fix
>   can be backported to older stable kernels like 6.6.
>
> Can you reproduce the leak with one of the in-kernel filesystems
> (either xfs or btrfs would be great) and with NFSD in 6.10-rc6?
>
> One more comment below.
>
>
> > here are some examples:
> >
> > crash> struct nfsd_file -x ffff88e160db0460
> > struct nfsd_file {
> >   nf_rlist = {
> >     rhead = {
> >       next = 0xffff8921fa2392f1
> >     },
> >     next = 0x0
> >   },
> >   nf_inode = 0xffff8882bc312ef8,
> >   nf_file = 0xffff88e2015b1500,
> >   nf_cred = 0xffff88e3ab0e7800,
> >   nf_net = 0xffffffff83d41600 <init_net>,
> >   nf_flags = 0x8,
> >   nf_ref = {
> >     refs = {
> >       counter = 0xc0000000
> >     }
> >   },
> >   nf_may = 0x4,
> >   nf_mark = 0xffff88e1bddfb320,
> >   nf_lru = {
> >     next = 0xffff88e160db04a8,
> >     prev = 0xffff88e160db04a8
> >   },
> >   nf_rcu = {
> >     next = 0x10000000000,
> >     func = 0x0
> >   },
> >   nf_birthtime = 0x73d22fc1728
> > }
> >
> > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x
> > ffff88839a53d850
> >   nf_flags = 0x8,
> >   nf_ref.refs.counter = 0x0
> >   nf_lru = {
> >     next = 0xffff88839a53d898,
> >     prev = 0xffff88839a53d898
> >   },
> >   nf_file = 0xffff88810ede8700,
> >
> > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x
> > ffff88c32b11e850
> >   nf_flags = 0x8,
> >   nf_ref.refs.counter = 0x0
> >   nf_lru = {
> >     next = 0xffff88c32b11e898,
> >     prev = 0xffff88c32b11e898
> >   },
> >   nf_file = 0xffff88c20a701c00,
> >
> > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x
> > ffff88e372709700
> >   nf_flags = 0xc,
> >   nf_ref.refs.counter = 0x0
> >   nf_lru = {
> >     next = 0xffff88e372709748,
> >     prev = 0xffff88e372709748
> >   },
> >   nf_file = 0xffff88e0725e6400,
> >
> > crash> struct nfsd_file.nf_flags,nf_ref.refs.counter,nf_lru,nf_file -x
> > ffff8982864944d0
> >   nf_flags = 0xc,
> >   nf_ref.refs.counter = 0x0
> >   nf_lru = {
> >     next = 0xffff898286494518,
> >     prev = 0xffff898286494518
> >   },
> >   nf_file = 0xffff89803c0ff700,
> >
> > The leak occurs when nfsd_file_put() races with nfsd_file_cond_queue()
> > or nfsd_file_lru_cb(). With the following patch, I haven't observed
> > any leak after a few days heavy nfs load:
>
> Our patch submission guidelines require a Signed-off-by:
> line at the end of the patch description. See the "Sign your work -
> the Developer's Certificate of Origin" section of
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.10-rc6
>
> (Needed here in case your fix is acceptable).
>
>
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index 1a6d5d000b85..2323829f7208 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -389,6 +389,17 @@ nfsd_file_put(struct nfsd_file *nf)
> >   if (!nfsd_file_lru_remove(nf))
> >   return;
> >   }
> > + /*
> > + * Racing with nfsd_file_cond_queue() or nfsd_file_lru_cb(),
> > + * it's unhashed but then removed from the dispose list,
> > + * so we need to free it.
> > + */
> > + if (refcount_read(&nf->nf_ref) == 0 &&
> > +     !test_bit(NFSD_FILE_HASHED, &nf->nf_flags) &&
> > +     list_empty(&nf->nf_lru)) {
> > + nfsd_file_free(nf);
> > + return;
> > + }
> >   }
> >   if (refcount_dec_and_test(&nf->nf_ref))
> >   nfsd_file_free(nf);
> > @@ -576,7 +587,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct
> > list_head *dispose)
> >   int decrement = 1;
> >
> >   /* If we raced with someone else unhashing, ignore it */
> > - if (!nfsd_file_unhash(nf))
> > + if (!test_bit(NFSD_FILE_HASHED, &nf->nf_flags))
> >   return;
> >
> >   /* If we can't get a reference, ignore it */
> > @@ -590,6 +601,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct
> > list_head *dispose)
> >   /* If refcount goes to 0, then put on the dispose list */
> >   if (refcount_sub_and_test(decrement, &nf->nf_ref)) {
> >   list_add(&nf->nf_lru, dispose);
> > + nfsd_file_unhash(nf);
> >   trace_nfsd_file_closing(nf);
> >   }
> >  }
> >
> > Please kindly review the patch and let me know if it makes sense.
> >
> > Thanks,
> >
> > -Youzhong
> >
>
> --
> Chuck Lever