On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > On Tue, 28 Nov 2023, Chuck Lever wrote: > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > work-queue thread running delayed_fput(). As you might imagine this > > > does not end well (20 million files in the queue at the time a snapshot > > > was taken for analysis). > > > > > > While this might point to a problem with the filesystem not handling the > > > final close efficiently, such problems should only hurt throughput, not > > > lead to memory exhaustion. > > > > I have this patch queued for v6.8: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > Thanks.... > I think that change is good, but I don't think it addresses the problem > mentioned in the description, and it is not directly relevant to the > problem I saw ... though it is complicated. > > The problem "workqueue ... hogged cpu..." probably means that > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > That will stop it from hogging the CPU whether it is tied to one CPU or > free to roam. > > Also that work is calling filp_close() which primarily calls > filp_flush(). > It also calls fput() but that does minimal work. If there is much work > to do then that is offloaded to another work-item. *That* is the > workitem that I had problems with. > > The problem I saw was with an older kernel which didn't have the nfsd > file cache and so probably is calling filp_close more often. Without the file cache, the filp_close() should be handled directly by the nfsd thread handling the RPC, IIRC. > So maybe > my patch isn't so important now. Particularly as nfsd now isn't closing > most files in-task but instead offloads that to another task. So the > final fput will not be handled by the nfsd task either. > > But I think there is room for improvement. Gathering lots of files > together into a list and closing them sequentially is not going to be as > efficient as closing them in parallel. I believe the file cache passes the filps to the work queue one at a time, but I don't think there's anything that forces the work queue to handle each flush/close completely before proceeding to the next. IOW there is some parallelism there already, especially now that nfsd_filecache_wq is UNBOUND. > > > For normal threads, the thread that closes the file also calls the > > > final fput so there is natural rate limiting preventing excessive growth > > > in the list of delayed fputs. For kernel threads, and particularly for > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > the thread from closing more files. > > > > I don't think we want to block nfsd threads waiting for files to > > close. Won't that be a potential denial of service? > > Not as much as the denial of service caused by memory exhaustion due to > an indefinitely growing list of files waiting to be closed by a single > thread of workqueue. The cache garbage collector is single-threaded, but nfsd_filecache_wq has a max_active setting of zero. > I think it is perfectly reasonable that when handling an NFSv4 CLOSE, > the nfsd thread should completely handle that request including all the > flush and ->release etc. If that causes any denial of service, then > simple increase the number of nfsd threads. > > For NFSv3 it is more complex. On the kernel where I saw a problem the > filp_close happen after each READ or WRITE (though I think the customer > was using NFSv4...). With the file cache there is no thread that is > obviously responsible for the close. > To get the sort of throttling that I think is need, we could possibly > have each "nfsd_open" check if there are pending closes, and to wait for > some small amount of progress. Well nfsd_open() in particular appears to be used only for readdir. But maybe nfsd_file_acquire() could wait briefly, in the garbage- collected case, if the nfsd_net's disposal queue is long. > But don't think it is reasonable for the nfsd threads to take none of > the burden of closing files as that can result in imbalance. > > I'll need to give this more thought. -- Chuck Lever