Re: [PATCH 1/3] nfsd: use __fput_sync() to avoid delayed closing of files.

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 12 Dec 2023 11:17:13 -0500

On Tue, Dec 12, 2023 at 09:04:58AM +1100, NeilBrown wrote:
> On Tue, 12 Dec 2023, Chuck Lever wrote:
> > On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote:
> > > On Sat, 09 Dec 2023, Chuck Lever wrote:
> > > > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote:
> > > > > Calling fput() directly or though filp_close() from a kernel thread like
> > > > > nfsd causes the final __fput() (if necessary) to be called from a
> > > > > workqueue.  This means that nfsd is not forced to wait for any work to
> > > > > complete.  If the ->release of ->destroy_inode function is slow for any
> > > > > reason, this can result in nfsd closing files more quickly than the
> > > > > workqueue can complete the close and the queue of pending closes can
> > > > > grow without bounces (30 million has been seen at one customer site,
> > > > > though this was in part due to a slowness in xfs which has since been
> > > > > fixed).
> > > > > 
> > > > > nfsd does not need this.
> > > > 
> > > > That is technically true, but IIUC, there is only one case where a
> > > > synchronous close matters for the backlog problem, and that's when
> > > > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other
> > > > call sites (except rename) are error paths, so there aren't negative
> > > > consequences for the lack of synchronous wait there...
> > > 
> > > What you say is technically true but it isn't the way I see it.
> > > 
> > > Firstly I should clarify that __fput_sync() is *not* a flushing close as
> > > you describe it below.
> > > All it does, apart for some trivial book-keeping, is to call ->release
> > > and possibly ->destroy_inode immediately rather than shunting them off
> > > to another thread.
> > > Apparently ->release sometimes does something that can deadlock with
> > > some kernel threads or if some awkward locks are held, so the whole
> > > final __fput is delay by default.  But this does not apply to nfsd.
> > > Standard fput() is really the wrong interface for nfsd to use.  
> > > It should use __fput_sync() (which shouldn't have such a scary name).
> > > 
> > > The comment above flush_delayed_fput() seems to suggest that unmounting
> > > is a core issue.  Maybe the fact that __fput() can call
> > > dissolve_on_fput() is a reason why it is sometimes safer to leave the
> > > work to later.  But I don't see that applying to nfsd.
> > > 
> > > Of course a ->release function *could* do synchronous writes just like
> > > the XFS ->destroy_inode function used to do synchronous reads.
> > 
> > I had assumed ->release for NFS re-export would flush due to close-
> > to-open semantics. There seem to be numerous corner cases that
> > might result in pile-ups which would change the situation in your
> > problem statement but might not result in an overall improvement.
> 
> That's the ->flush call in filp_close().
> 
> > > I don't think we should ever try to hide that by putting it in
> > > a workqueue.  It's probably a bug and it is best if bugs are visible.
> > 
> > 
> > I'm not objecting, per se, to this change. I would simply like to
> > see a little more due diligence before moving forward until it is
> > clear how frequently ->release or ->destroy_inode will do I/O (or
> > "is slow for any reason" as you say above).
> > 
> > 
> > > Note that the XFS ->release function does call filemap_flush() in some
> > > cases, but that is an async flush, so __fput_sync doesn't wait for the
> > > flush to complete.
> > 
> > When Jeff was working on the file cache a year ago, I did some
> > performance analysis that shows even an async flush is costly when
> > there is a lot of dirty data in the file being closed. The VFS walks
> > through the whole file and starts I/O on every dirty page. This is
> > quite CPU intensive, and can take on the order of a millisecond
> > before the async flush request returns to its caller.
> > 
> > IME async flushes are not free.
> 
> True, they aren't free.  But some thread has to pay that price.
> I think nfsd should.

An async flush can be as expensive as a synchronous flush in some
cases. I'm not convinced that simply because a flush happens to be
asynchronous, that makes it harmless to do at scale in the
foreground.

Our original desire way back when was to insert tactical async
flushes during UNSTABLE WRITEs to get the server writing dirty data
to storage sooner. It had an immediate negative throughput and
latency impact.

I have no philosophical disagreement about nfsd threads doing more
work during a file close. I'm just saying that moving even an async
flush from a worker to an nfsd thread might be visible to a client
workload.

> You might argue that nfsd should wait to pay the price until after it
> has sent a reply to the client.  My patches already effectively do that
> for garbage-collected files.  Doing it for all files would probably be
> easy. But is it really worth the (small) complexity?  I don't know.
>
> > > The way I see this patch is that fput() is the wrong interface for nfsd
> > > to use, __fput_sync is the right interface.  So we should change.  1
> > > patch.
> > 
> > The practical matter is I see this as a change with a greater than
> > zero risk, and we need to mitigate that risk. Or rather, as a
> > maintainer of NFSD, /I/ need to see that the risk is as minimal as
> > is practical.
> > 
> > 
> > > The details about exhausting memory explain a particular symptom that
> > > motivated the examination which revealed that nfsd was using the wrong
> > > interface.
> > > 
> > > If we have nfsd sometimes using fput() and sometimes __fput_sync, then
> > > we need to have clear rules for when to use which.  It is much easier to
> > > have a simple rule: always use __fput_sync().
> > 
> > I don't agree that we should just flop all these over and hope for
> > the best. In particular:
> > 
> >  - the changes in fs/nfsd/filecache.c appear to revert a bug
> >    fix, so I need to see data that shows that change doesn't
> >    cause a re-regression
> 
> The bug fix you refer to is
>   "nfsd: don't fsync nfsd_files on last close"
> The patch doesn't change when fsync (or ->flush) is called, so
> it doesn't revert this bugfix.

If the async flush is being done directly by nfsd_file_free instead
of deferred to a work queue, that will slow down file closing, IMO,
in particular in the single-threaded GC case.

Again, not an objection, but we need to be aware of the impact of
this change, and if it is negative, try to mitigate it. If that
mitigation takes the form of patch 2/3, then maybe that patch needs
to be applied before this one.

> >  - the changes in fs/lockd/ can result in long waits while a
> >    global mutex is held (global as in all namespaces and all
> >    locked files on the server), so I need to see data that
> >    demonstrates there won't be a regression
> 
> It's probably impossible to provide any such data.
> The patch certainly moves work inside that mutex and so would increase
> the hold time, if only slightly.  Is that lock hot enough to notice?
> Conventional wisdom is that locking is only a tiny fraction of NFS
> traffic.  It might be possible to construct a workload that saturates
> lockd, but I doubt it would be relevant to the real world.

I have seen, in the past, customer workloads that are nothing but a
stream of NLM lock and unlock requests, and the business requirement
was that those need to go as fast as possible. Inserting I/O (even
occasional asynchronous I/O) would result in an unwanted regression
for that kind of workload.

> Maybe we should just break up that lock so that the problem becomes moot.

Let's drop this hunk for now until we (I) have a way to assess this
change properly. I'm just not convinced this hunk is going to be a
harmless change in some cases, and overall it doesn't appear to be
necessary to address the close back-pressure concern.

Replacing fput() here might be done after the file table data
structure becomes more parallel. Maybe an rhashtable would be
suitable.

> >  - the other changes don't appear to have motivation in terms
> >    of performance or behavior, and carry similar (if lesser)
> >    risks as the other two changes. My preferred solution to
> >    potential auditor confusion about the use of __fput_sync()
> >    in some places and fput() in others is to document, and
> >    leave call sites alone if there's no technical reason to
> >    change them at this time.
> 
> Sounds to me like a good way to grow technical debt, but I'll do it like
> that if you prefer.

Yes, I prefer a helper with an explanation of why nfsd uses
__fput_sync() for certain cases. I'm going to take the admonition
in __fput_sync()'s kdoc comment seriously.

-- 
Chuck Lever