On Thu, Nov 30, 2023 at 12:47:58PM -0500, Jeff Layton wrote: > On Wed, 2023-11-29 at 09:04 -0500, Chuck Lever wrote: > > On Wed, Nov 29, 2023 at 10:20:23AM +1100, NeilBrown wrote: > > > On Wed, 29 Nov 2023, Christian Brauner wrote: > > > > [Reusing the trimmed Cc] > > > > > > > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote: > > > > > On Tue, 28 Nov 2023, Chuck Lever wrote: > > > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote: > > > > > > > > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to > > > > > > > delayed_fput_lists nearly twice as fast they are retired by a single > > > > > > > work-queue thread running delayed_fput(). As you might imagine this > > > > > > > does not end well (20 million files in the queue at the time a snapshot > > > > > > > was taken for analysis). > > > > > > > > > > > > > > While this might point to a problem with the filesystem not handling the > > > > > > > final close efficiently, such problems should only hurt throughput, not > > > > > > > lead to memory exhaustion. > > > > > > > > > > > > I have this patch queued for v6.8: > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0 > > > > > > > > > > > > > > > > Thanks.... > > > > > I think that change is good, but I don't think it addresses the problem > > > > > mentioned in the description, and it is not directly relevant to the > > > > > problem I saw ... though it is complicated. > > > > > > > > > > The problem "workqueue ... hogged cpu..." probably means that > > > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop. > > > > > That will stop it from hogging the CPU whether it is tied to one CPU or > > > > > free to roam. > > > > > > > > > > Also that work is calling filp_close() which primarily calls > > > > > filp_flush(). > > > > > It also calls fput() but that does minimal work. If there is much work > > > > > to do then that is offloaded to another work-item. *That* is the > > > > > workitem that I had problems with. > > > > > > > > > > The problem I saw was with an older kernel which didn't have the nfsd > > > > > file cache and so probably is calling filp_close more often. So maybe > > > > > my patch isn't so important now. Particularly as nfsd now isn't closing > > > > > most files in-task but instead offloads that to another task. So the > > > > > final fput will not be handled by the nfsd task either. > > > > > > > > > > But I think there is room for improvement. Gathering lots of files > > > > > together into a list and closing them sequentially is not going to be as > > > > > efficient as closing them in parallel. > > > > > > > > > > > > > > > > > > For normal threads, the thread that closes the file also calls the > > > > > > > final fput so there is natural rate limiting preventing excessive growth > > > > > > > in the list of delayed fputs. For kernel threads, and particularly for > > > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent > > > > > > > the thread from closing more files. > > > > > > > > > > > > I don't think we want to block nfsd threads waiting for files to > > > > > > close. Won't that be a potential denial of service? > > > > > > > > > > Not as much as the denial of service caused by memory exhaustion due to > > > > > an indefinitely growing list of files waiting to be closed by a single > > > > > thread of workqueue. > > > > > > > > It seems less likely that you run into memory exhausting than a DOS > > > > because nfsd() is busy closing fds. Especially because you default to > > > > single nfsd thread afaict. > > > > > > An nfsd thread would not end up being busy closing fds any more than it > > > can already be busy reading data or busy syncing out changes or busying > > > renaming a file. > > > Which it is say: of course it can be busy doing this, but doing this sort > > > of thing is its whole purpose in life. > > > > > > If an nfsd thread only completes the close that it initiated the close > > > on (which is what I am currently proposing) then there would be at most > > > one, or maybe 2, fds to close after handling each request. > > > > Closing files more aggressively would seem to entirely defeat the > > purpose of the file cache, which is to avoid the overhead of opens > > and closes on frequently-used files. > > > > And usually Linux prefers to let the workload consume as many free > > resources as possible before it applies back pressure or cache > > eviction. > > > > IMO the first step should be removing head-of-queue blocking from > > the file cache's background closing mechanism. That might be enough > > to avoid forming a backlog in most cases. > > That's not quite what task_work does. Neil's patch wouldn't result in > closes happening more aggressively. It would just make it so that we > don't queue the delayed part of the fput process to a workqueue like we > do today. > > Instead, the nfsd threads would have to clean that part up themselves, > like syscalls do before returning to userland. I think that idea makes > sense overall since that mirrors what we already do in userland. > > In the event that all of the nfsd threads are tied up in slow task_work > jobs...tough luck. That at least makes it more of a self-limiting > problem since RPCs will start being queueing, rather than allowing dead > files to just pile onto the list. Thanks for helping me understand the proposal. task_work would cause nfsd threads to wait for flush/close operations that others have already started; it would not increase the rate of closing cached file descriptors. The thing that nfsd_filesystem_wq does is compartmentalize the flush/close workload so that a heavy flush/close workload in one net namespace does not negatively impact other namespaces. IIUC, then, task_work does not discriminate between namespaces -- if one namespace is creating a backlog of dirty files to close, all nfsd threads would need to handle that backlog, and thus all namespaces would bear (a part of) that backlog. -- Chuck Lever