Re: [PATCH 2/2] nfsd: Don't leave work of closing files to a work queue.

Chuck Lever <chuck.lever@xxxxxxxxxx> · Mon, 4 Dec 2023 18:48:12 -0500

On Tue, Dec 05, 2023 at 09:21:08AM +1100, NeilBrown wrote:
> On Tue, 05 Dec 2023, Chuck Lever wrote:
> > On Mon, Dec 04, 2023 at 12:36:42PM +1100, NeilBrown wrote:
> > > The work of closing a file can have non-trivial cost.  Doing it in a
> > > separate work queue thread means that cost isn't imposed on the nfsd
> > > threads and an imbalance can be created.
> > > 
> > > I have evidence from a customer site when nfsd is being asked to modify
> > > many millions of files which causes sufficient memory pressure that some
> > > cache (in XFS I think) gets cleaned earlier than would be ideal.  When
> > > __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode()
> > > needs to synchronously read back previously cached info from storage.
> > > This slows down the single thread that is making all the final __dput()
> > > calls for all the nfsd threads with the net result that files are added
> > > to the delayed_fput_list faster than they are removed, and the system
> > > eventually runs out of memory.
> > > 
> > > To avoid this work imbalance that exhausts memory, this patch moves all
> > > work for closing files into the nfsd threads.  This means that when the
> > > work imposes a cost, that cost appears where it would be expected - in
> > > the work of the nfsd thread.
> > 
> > Thanks for pursuing this next step in the evolution of the NFSD
> > file cache.
> > 
> > Your problem statement should mention whether you have observed the
> > issue with an NFSv3 or an NFSv4 workload or if you see this issue
> > with both, since those two use cases are handled very differently
> > within the file cache implementation.
> 
> I have added:
> 
> =============
> The customer was using NFSv4.  I can demonstrate the same problem using
> NFSv3 or NFSv4 (which close files in different ways) by adding
> msleep(25) to for FMODE_WRITE files in __fput().  This simulates
> slowness in the final close and when writing through nfsd it causes
> /proc/sys/fs/file-nr to grow without bound.
> ==============
> 
> > 
> > 
> > > There are two changes to achieve this.
> > > 
> > > 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the
> > >    final __dput() for any file closed by a given nfsd thread is handled
> > >    by that thread.  This ensures that the number of files that are
> > >    queued for a final close is limited by the number of threads and
> > >    cannot grow without bound.
> > > 
> > > 2/ Files opened for NFSv3 are never explicitly closed by the client and are
> > >   kept open by the server in the "filecache", which responds to memory
> > >   pressure, is garbage collected even when there is no pressure, and
> > >   sometimes closes files when there is particular need such as for
> > >   rename.
> > 
> > There is a good reason for close-on-rename: IIRC we want to avoid
> > triggering a silly-rename on NFS re-exports.
> > 
> > Also, I think we do want to close cached garbage-collected files
> > quickly, even without memory pressure. Files left open in this way
> > can conflict with subsequent NFSv4 OPENs that might hand out a
> > delegation as long as no other clients are using them. Files held
> > open by the file cache will interfere with that.
> 
> Yes - I agree all this behaviour is appropriate.  I was just setting out
> the current behaviour of the filecache so that effect of the proposed
> changes would be easier to understand.

Ok, I misread "when there is particular need" as "when there is no
particular need." My bad.

> > >   These files currently have filp_close() called in a dedicated
> > >   work queue, so their __dput() can have no effect on nfsd threads.
> > > 
> > >   This patch discards the work queue and instead has each nfsd thread
> > >   call flip_close() on as many as 8 files from the filecache each time
> > >   it acts on a client request (or finds there are no pending client
> > >   requests).  If there are more to be closed, more threads are woken.
> > >   This spreads the work of __dput() over multiple threads and imposes
> > >   any cost on those threads.
> > > 
> > >   The number 8 is somewhat arbitrary.  It needs to be greater than 1 to
> > >   ensure that files are closed more quickly than they can be added to
> > >   the cache.  It needs to be small enough to limit the per-request
> > >   delays that will be imposed on clients when all threads are busy
> > >   closing files.
> > 
> > IMO we want to explicitly separate the mechanisms of handling
> > garbage-collected files and non-garbage-collected files.
> 
> I think we already have explicit separation.
> garbage-collected files are handled to nfsd_file_display_list_delayed(),
> either when they fall off the lru or through nfsd_file_close_inode() -
> which is used by lease and fsnotify callbacks.
> 
> non-garbage collected files are closed directly by nfsd_file_put().

The separation is more clear to me now. Building this all into a
single patch kind of blurred the edges between the two.

> > In the non-garbage-collected (NFSv4) case, the kthread can wait
> > for everything it has opened to be closed. task_work seems
> > appropriate for that IIUC.
> 
> Agreed.  The task_work change is all that we need for NFSv4.
> 
> > The problem with handling a limited number of garbage-collected
> > items is that once the RPC workload stops, any remaining open
> > files will remain open because garbage collection has effectively
> > stopped. We really need those files closed out within a couple of
> > seconds.
> 
> Why would garbage collection stop?

Because with your patch GC now appears to be driven through
nfsd_file_dispose_some(). I see now that there is a hidden
recursion that wakes more nfsd threads if there's more GC to
be done. So file disposal is indeed not dependent on more
ingress RPC traffic.

The "If there are more to be closed" remark above in the patch
description was ambiguous to me, but I think I get it now.

> nfsd_filecache_laundrette is still running on the system_wq.  It will
> continue to garbage collect and queue files using
> nfsd_file_display_list_delayed().
> That will wake up an nfsd thread if none is running.  The thread will
> close a few, but will first wake another thread if there was more than
> it was willing to manage.  So the closing of files should proceed
> promptly, and if any close operation takes a non-trivial amount of time,
> more threads will be woken and work will proceed in parallel.

OK, that is what the svc_wake_up()s are doing.

> > And, as we discussed in a previous thread, replacing the per-
> > namespace worker with a parallel mechanism would help GC proceed
> > more quickly to reduce the flush/close backlog for NFSv3.
> 
> This patch discards the per-namespace worker.
> 
> The GC step (searching the LRU list for "garbage") is still
> single-threaded. The filecache is shared by all net-namespaces and
> there is a single GC thread for the filecache.

Agreed.

> Files that are found *were* filp_close()ed by per-net-fs work-items.
> With this patch the filp_close() is called by the nfsd threads.
> 
> The file __fput of those files *was* handled by a single system-wide
> work-item.  With this patch they are called by the nfsd thread which
> called the filp_close().

Fwiw, none of that is obvious to me when looking at the diff.

> > > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > > ---
> > >  fs/nfsd/filecache.c | 62 ++++++++++++++++++---------------------------
> > >  fs/nfsd/filecache.h |  1 +
> > >  fs/nfsd/nfssvc.c    |  6 +++++
> > >  3 files changed, 32 insertions(+), 37 deletions(-)
> > > 
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index ee9c923192e0..55268b7362d4 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -39,6 +39,7 @@
> > >  #include <linux/fsnotify.h>
> > >  #include <linux/seq_file.h>
> > >  #include <linux/rhashtable.h>
> > > +#include <linux/task_work.h>
> > >  
> > >  #include "vfs.h"
> > >  #include "nfsd.h"
> > > @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
> > >  static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
> > >  
> > >  struct nfsd_fcache_disposal {
> > > -	struct work_struct work;
> > >  	spinlock_t lock;
> > >  	struct list_head freeme;
> > >  };
> > >  
> > > -static struct workqueue_struct *nfsd_filecache_wq __read_mostly;
> > > -
> > >  static struct kmem_cache		*nfsd_file_slab;
> > >  static struct kmem_cache		*nfsd_file_mark_slab;
> > >  static struct list_lru			nfsd_file_lru;
> > > @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
> > >  		spin_lock(&l->lock);
> > >  		list_move_tail(&nf->nf_lru, &l->freeme);
> > >  		spin_unlock(&l->lock);
> > > -		queue_work(nfsd_filecache_wq, &l->work);
> > > +		svc_wake_up(nn->nfsd_serv);
> > >  	}
> > >  }
> > >  
> > > +/**
> > > + * nfsd_file_dispose_some
> > 
> > This needs a short description and:
> > 
> >  * @nn: namespace to check
> > 
> > Or something more enlightening than that.
> > 
> > Also, the function name exposes mechanism; I think I'd prefer a name
> > that is more abstract, such as nfsd_file_net_release() ?
> 
> Sometimes exposing mechanism is a good thing.  It means the casual reader
> can get a sense of what the function does without having to look at the
> function.
> So I still prefer my name, but I changed to nfsd_file_net_dispose() so
> as suit your preference, but follow the established pattern of using the
> word "dispose".  "release" usually just drops a reference.  "dispose"
> makes it clear that the thing is going away now.
> 
> /**
>  * nfsd_file_net_dispose - deal with nfsd_files wait to be disposed.
>  * @nn: nfsd_net in which to find files to be disposed.
>  *
>  * When files held open for nfsv3 are removed from the filecache, whether

This comment is helpful. But note that we quite purposely do not
refer to NFS versions in filecache.c -- it's either garbage-
collected or not garbage-collected files. IIRC on occasion NFSv3
wants to use a non-garbage-collected file, and NFSv4 might sometimes
use a GC-d file. I've forgotten the details.

>  * due to memory pressure or garbage collection, they are queued to
>  * a per-net-ns queue.  This function completes the disposal, either
>  * directly or by waking another nfsd thread to help with the work.
>  */

I understand why you want to keep this name: this function handles
only garbage-collected files.

I would still like nfsd() to call a wrapper function to handle
the details of closing both types of files rather than open-coding
calling nfsd_file_net_dispose() and task_run_work(), especially
because there is no code comment explaining why the task_run_work()
call is needed. That level of filecache implementation detail
doesn't belong in nfsd().

-- 
Chuck Lever