Re: [PATCH/RFC] core/nfsd: allow kernel threads to use task_work.

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 30 Nov 2023 13:33:55 -0500

On Thu, 2023-11-30 at 13:07 -0500, Chuck Lever wrote:
> On Thu, Nov 30, 2023 at 12:47:58PM -0500, Jeff Layton wrote:
> > On Wed, 2023-11-29 at 09:04 -0500, Chuck Lever wrote:
> > > On Wed, Nov 29, 2023 at 10:20:23AM +1100, NeilBrown wrote:
> > > > On Wed, 29 Nov 2023, Christian Brauner wrote:
> > > > > [Reusing the trimmed Cc]
> > > > > 
> > > > > On Tue, Nov 28, 2023 at 11:16:06AM +1100, NeilBrown wrote:
> > > > > > On Tue, 28 Nov 2023, Chuck Lever wrote:
> > > > > > > On Tue, Nov 28, 2023 at 09:05:21AM +1100, NeilBrown wrote:
> > > > > > > > 
> > > > > > > > I have evidence from a customer site of 256 nfsd threads adding files to
> > > > > > > > delayed_fput_lists nearly twice as fast they are retired by a single
> > > > > > > > work-queue thread running delayed_fput().  As you might imagine this
> > > > > > > > does not end well (20 million files in the queue at the time a snapshot
> > > > > > > > was taken for analysis).
> > > > > > > > 
> > > > > > > > While this might point to a problem with the filesystem not handling the
> > > > > > > > final close efficiently, such problems should only hurt throughput, not
> > > > > > > > lead to memory exhaustion.
> > > > > > > 
> > > > > > > I have this patch queued for v6.8:
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-next&id=c42661ffa58acfeaf73b932dec1e6f04ce8a98c0
> > > > > > > 
> > > > > > 
> > > > > > Thanks....
> > > > > > I think that change is good, but I don't think it addresses the problem
> > > > > > mentioned in the description, and it is not directly relevant to the
> > > > > > problem I saw ... though it is complicated.
> > > > > > 
> > > > > > The problem "workqueue ...  hogged cpu..." probably means that
> > > > > > nfsd_file_dispose_list() needs a cond_resched() call in the loop.
> > > > > > That will stop it from hogging the CPU whether it is tied to one CPU or
> > > > > > free to roam.
> > > > > > 
> > > > > > Also that work is calling filp_close() which primarily calls
> > > > > > filp_flush().
> > > > > > It also calls fput() but that does minimal work.  If there is much work
> > > > > > to do then that is offloaded to another work-item.  *That* is the
> > > > > > workitem that I had problems with.
> > > > > > 
> > > > > > The problem I saw was with an older kernel which didn't have the nfsd
> > > > > > file cache and so probably is calling filp_close more often.  So maybe
> > > > > > my patch isn't so important now.  Particularly as nfsd now isn't closing
> > > > > > most files in-task but instead offloads that to another task.  So the
> > > > > > final fput will not be handled by the nfsd task either.
> > > > > > 
> > > > > > But I think there is room for improvement.  Gathering lots of files
> > > > > > together into a list and closing them sequentially is not going to be as
> > > > > > efficient as closing them in parallel.
> > > > > > 
> > > > > > > 
> > > > > > > > For normal threads, the thread that closes the file also calls the
> > > > > > > > final fput so there is natural rate limiting preventing excessive growth
> > > > > > > > in the list of delayed fputs.  For kernel threads, and particularly for
> > > > > > > > nfsd, delayed in the final fput do not impose any throttling to prevent
> > > > > > > > the thread from closing more files.
> > > > > > > 
> > > > > > > I don't think we want to block nfsd threads waiting for files to
> > > > > > > close. Won't that be a potential denial of service?
> > > > > > 
> > > > > > Not as much as the denial of service caused by memory exhaustion due to
> > > > > > an indefinitely growing list of files waiting to be closed by a single
> > > > > > thread of workqueue.
> > > > > 
> > > > > It seems less likely that you run into memory exhausting than a DOS
> > > > > because nfsd() is busy closing fds. Especially because you default to
> > > > > single nfsd thread afaict.
> > > > 
> > > > An nfsd thread would not end up being busy closing fds any more than it
> > > > can already be busy reading data or busy syncing out changes or busying
> > > > renaming a file.
> > > > Which it is say: of course it can be busy doing this, but doing this sort
> > > > of thing is its whole purpose in life.
> > > > 
> > > > If an nfsd thread only completes the close that it initiated the close
> > > > on (which is what I am currently proposing) then there would be at most
> > > > one, or maybe 2, fds to close after handling each request.
> > > 
> > > Closing files more aggressively would seem to entirely defeat the
> > > purpose of the file cache, which is to avoid the overhead of opens
> > > and closes on frequently-used files.
> > > 
> > > And usually Linux prefers to let the workload consume as many free
> > > resources as possible before it applies back pressure or cache
> > > eviction.
> > > 
> > > IMO the first step should be removing head-of-queue blocking from
> > > the file cache's background closing mechanism. That might be enough
> > > to avoid forming a backlog in most cases.
> > 
> > That's not quite what task_work does. Neil's patch wouldn't result in
> > closes happening more aggressively. It would just make it so that we
> > don't queue the delayed part of the fput process to a workqueue like we
> > do today.
> > 
> > Instead, the nfsd threads would have to clean that part up themselves,
> > like syscalls do before returning to userland. I think that idea makes
> > sense overall since that mirrors what we already do in userland.
> > 
> > In the event that all of the nfsd threads are tied up in slow task_work
> > jobs...tough luck. That at least makes it more of a self-limiting
> > problem since RPCs will start being queueing, rather than allowing dead
> > files to just pile onto the list.
> 
> Thanks for helping me understand the proposal. task_work would cause
> nfsd threads to wait for flush/close operations that others have
> already started; it would not increase the rate of closing cached
> file descriptors.
> 

Note that task_work is completely a per-task thing. Each nfsd thread now
becomes responsible for cleaning up the files that were closed as part
of processing the last RPC that it processed.

Closes that occur during the processing of an RPC would be finished
before the thread picks up a new RPC to process, but I don't see how the
thread would be responsible for closes that happen in a different task
altogether.

> The thing that nfsd_filesystem_wq does is compartmentalize the
> flush/close workload so that a heavy flush/close workload in one
> net namespace does not negatively impact other namespaces. IIUC,
> then, task_work does not discriminate between namespaces -- if one
> namespace is creating a backlog of dirty files to close, all nfsd
> threads would need to handle that backlog, and thus all namespaces
> would bear (a part of) that backlog.
> 

Closes that are queued to the nfsd_filesystem_wq won't be affected by 
this patch, since those files get closed in the context of a workqueue
thread. In most cases, those are v2/3 files that have timed out.

FWIW, I expect that this patch will mostly affect NFSv4, since those get
closed more directly by nfsd. I wonder if we might want to consider
doing something like this with lockd as well.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>