On Tue, 05 Dec 2023, Chuck Lever wrote: > On Mon, Dec 04, 2023 at 12:36:42PM +1100, NeilBrown wrote: > > The work of closing a file can have non-trivial cost. Doing it in a > > separate work queue thread means that cost isn't imposed on the nfsd > > threads and an imbalance can be created. > > > > I have evidence from a customer site when nfsd is being asked to modify > > many millions of files which causes sufficient memory pressure that some > > cache (in XFS I think) gets cleaned earlier than would be ideal. When > > __dput (from the workqueue) calls __dentry_kill, xfs_fs_destroy_inode() > > needs to synchronously read back previously cached info from storage. > > This slows down the single thread that is making all the final __dput() > > calls for all the nfsd threads with the net result that files are added > > to the delayed_fput_list faster than they are removed, and the system > > eventually runs out of memory. > > > > To avoid this work imbalance that exhausts memory, this patch moves all > > work for closing files into the nfsd threads. This means that when the > > work imposes a cost, that cost appears where it would be expected - in > > the work of the nfsd thread. > > Thanks for pursuing this next step in the evolution of the NFSD > file cache. > > Your problem statement should mention whether you have observed the > issue with an NFSv3 or an NFSv4 workload or if you see this issue > with both, since those two use cases are handled very differently > within the file cache implementation. I have added: ============= The customer was using NFSv4. I can demonstrate the same problem using NFSv3 or NFSv4 (which close files in different ways) by adding msleep(25) to for FMODE_WRITE files in __fput(). This simulates slowness in the final close and when writing through nfsd it causes /proc/sys/fs/file-nr to grow without bound. ============== > > > > There are two changes to achieve this. > > > > 1/ PF_RUNS_TASK_WORK is set and task_work_run() is called, so that the > > final __dput() for any file closed by a given nfsd thread is handled > > by that thread. This ensures that the number of files that are > > queued for a final close is limited by the number of threads and > > cannot grow without bound. > > > > 2/ Files opened for NFSv3 are never explicitly closed by the client and are > > kept open by the server in the "filecache", which responds to memory > > pressure, is garbage collected even when there is no pressure, and > > sometimes closes files when there is particular need such as for > > rename. > > There is a good reason for close-on-rename: IIRC we want to avoid > triggering a silly-rename on NFS re-exports. > > Also, I think we do want to close cached garbage-collected files > quickly, even without memory pressure. Files left open in this way > can conflict with subsequent NFSv4 OPENs that might hand out a > delegation as long as no other clients are using them. Files held > open by the file cache will interfere with that. Yes - I agree all this behaviour is appropriate. I was just setting out the current behaviour of the filecache so that effect of the proposed changes would be easier to understand. > > > > These files currently have filp_close() called in a dedicated > > work queue, so their __dput() can have no effect on nfsd threads. > > > > This patch discards the work queue and instead has each nfsd thread > > call flip_close() on as many as 8 files from the filecache each time > > it acts on a client request (or finds there are no pending client > > requests). If there are more to be closed, more threads are woken. > > This spreads the work of __dput() over multiple threads and imposes > > any cost on those threads. > > > > The number 8 is somewhat arbitrary. It needs to be greater than 1 to > > ensure that files are closed more quickly than they can be added to > > the cache. It needs to be small enough to limit the per-request > > delays that will be imposed on clients when all threads are busy > > closing files. > > IMO we want to explicitly separate the mechanisms of handling > garbage-collected files and non-garbage-collected files. I think we already have explicit separation. garbage-collected files are handled to nfsd_file_display_list_delayed(), either when they fall off the lru or through nfsd_file_close_inode() - which is used by lease and fsnotify callbacks. non-garbage collected files are closed directly by nfsd_file_put(). > > In the non-garbage-collected (NFSv4) case, the kthread can wait > for everything it has opened to be closed. task_work seems > appropriate for that IIUC. Agreed. The task_work change is all that we need for NFSv4. > > The problem with handling a limited number of garbage-collected > items is that once the RPC workload stops, any remaining open > files will remain open because garbage collection has effectively > stopped. We really need those files closed out within a couple of > seconds. Why would garbage collection stop? nfsd_filecache_laundrette is still running on the system_wq. It will continue to garbage collect and queue files using nfsd_file_display_list_delayed(). That will wake up an nfsd thread if none is running. The thread will close a few, but will first wake another thread if there was more than it was willing to manage. So the closing of files should proceed promptly, and if any close operation takes a non-trivial amount of time, more threads will be woken and work will proceed in parallel. > > We used to have watermarks in the nfsd_file_put() path to kick > garbage-collection if there were too many open files. Instead, > waiting for the GC thread to make progress before recycling the > kthread might be beneficial. "too many" is only meaningful in the context of memory usage. Having the shrinker callback is exactly the right way to address this - nothing else is needed. The GC thread is expected to be CPU intensive. The main cause of delay is skipping over lots of files that cannot be closed yet - looking for files that can. This could delay the closing of files, but not nearly as much as the delays I saw caused by synchronous IO. We might be able to improve the situation a bit by queuing files as soon as list_lru_walk finds them, rather than gathering them all into a list and the queuing them one by one from that list. It isn't clear to me that there is an issue here that needs fixing. > > And, as we discussed in a previous thread, replacing the per- > namespace worker with a parallel mechanism would help GC proceed > more quickly to reduce the flush/close backlog for NFSv3. This patch discards the per-namespace worker. The GC step (searching the LRU list for "garbage") is still single-threaded. The filecache is shared by all net-namespaces and there is a single GC thread for the filecache. Files that are found *were* filp_close()ed by per-net-fs work-items. With this patch the filp_close() is called by the nfsd threads. The file __fput of those files *was* handled by a single system-wide work-item. With this patch they are called by the nfsd thread which called the filp_close(). > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx> > > --- > > fs/nfsd/filecache.c | 62 ++++++++++++++++++--------------------------- > > fs/nfsd/filecache.h | 1 + > > fs/nfsd/nfssvc.c | 6 +++++ > > 3 files changed, 32 insertions(+), 37 deletions(-) > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > > index ee9c923192e0..55268b7362d4 100644 > > --- a/fs/nfsd/filecache.c > > +++ b/fs/nfsd/filecache.c > > @@ -39,6 +39,7 @@ > > #include <linux/fsnotify.h> > > #include <linux/seq_file.h> > > #include <linux/rhashtable.h> > > +#include <linux/task_work.h> > > > > #include "vfs.h" > > #include "nfsd.h" > > @@ -61,13 +62,10 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age); > > static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions); > > > > struct nfsd_fcache_disposal { > > - struct work_struct work; > > spinlock_t lock; > > struct list_head freeme; > > }; > > > > -static struct workqueue_struct *nfsd_filecache_wq __read_mostly; > > - > > static struct kmem_cache *nfsd_file_slab; > > static struct kmem_cache *nfsd_file_mark_slab; > > static struct list_lru nfsd_file_lru; > > @@ -421,10 +419,31 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose) > > spin_lock(&l->lock); > > list_move_tail(&nf->nf_lru, &l->freeme); > > spin_unlock(&l->lock); > > - queue_work(nfsd_filecache_wq, &l->work); > > + svc_wake_up(nn->nfsd_serv); > > } > > } > > > > +/** > > + * nfsd_file_dispose_some > > This needs a short description and: > > * @nn: namespace to check > > Or something more enlightening than that. > > Also, the function name exposes mechanism; I think I'd prefer a name > that is more abstract, such as nfsd_file_net_release() ? Sometimes exposing mechanism is a good thing. It means the casual reader can get a sense of what the function does without having to look at the function. So I still prefer my name, but I changed to nfsd_file_net_dispose() so as suit your preference, but follow the established pattern of using the word "dispose". "release" usually just drops a reference. "dispose" makes it clear that the thing is going away now. /** * nfsd_file_net_dispose - deal with nfsd_files wait to be disposed. * @nn: nfsd_net in which to find files to be disposed. * * When files held open for nfsv3 are removed from the filecache, whether * due to memory pressure or garbage collection, they are queued to * a per-net-ns queue. This function completes the disposal, either * directly or by waking another nfsd thread to help with the work. */ > > > + * > > + */ > > +void nfsd_file_dispose_some(struct nfsd_net *nn) > > +{ > > + struct nfsd_fcache_disposal *l = nn->fcache_disposal; > > + LIST_HEAD(dispose); > > + int i; > > + > > + if (list_empty(&l->freeme)) > > + return; > > + spin_lock(&l->lock); > > + for (i = 0; i < 8 && !list_empty(&l->freeme); i++) > > + list_move(l->freeme.next, &dispose); > > + spin_unlock(&l->lock); > > + if (!list_empty(&l->freeme)) > > + svc_wake_up(nn->nfsd_serv); > > + nfsd_file_dispose_list(&dispose); .. > > @@ -949,6 +950,7 @@ nfsd(void *vrqstp) > > } > > > > current->fs->umask = 0; > > + current->flags |= PF_RUNS_TASK_WORK; > > > > atomic_inc(&nfsdstats.th_cnt); > > > > @@ -963,6 +965,10 @@ nfsd(void *vrqstp) > > > > svc_recv(rqstp); > > validate_process_creds(); > > + > > + nfsd_file_dispose_some(nn); > > + if (task_work_pending(current)) > > + task_work_run(); > > I'd prefer that these task_work details reside inside > nfsd_file_dispose_some(), or whatever we want to call to call it ... I don't agree. They are performing quite separate tasks. nfsd_file_net_dispose() is disposing files queued for this net. task_run_work() is finalising the close of any file closed by this thread, including those used for NFSv4 that are not touched by nfsd_file_dispose_some(). I don't think they belong in the same function. Thanks, NeilBrown