On Mon, Dec 11, 2023 at 09:47:35AM +1100, NeilBrown wrote: > On Sat, 09 Dec 2023, Chuck Lever wrote: > > On Fri, Dec 08, 2023 at 02:27:26PM +1100, NeilBrown wrote: > > > Calling fput() directly or though filp_close() from a kernel thread like > > > nfsd causes the final __fput() (if necessary) to be called from a > > > workqueue. This means that nfsd is not forced to wait for any work to > > > complete. If the ->release of ->destroy_inode function is slow for any > > > reason, this can result in nfsd closing files more quickly than the > > > workqueue can complete the close and the queue of pending closes can > > > grow without bounces (30 million has been seen at one customer site, > > > though this was in part due to a slowness in xfs which has since been > > > fixed). > > > > > > nfsd does not need this. > > > > That is technically true, but IIUC, there is only one case where a > > synchronous close matters for the backlog problem, and that's when > > nfsd_file_free() is called from nfsd_file_put(). AFAICT all other > > call sites (except rename) are error paths, so there aren't negative > > consequences for the lack of synchronous wait there... > > What you say is technically true but it isn't the way I see it. > > Firstly I should clarify that __fput_sync() is *not* a flushing close as > you describe it below. > All it does, apart for some trivial book-keeping, is to call ->release > and possibly ->destroy_inode immediately rather than shunting them off > to another thread. > Apparently ->release sometimes does something that can deadlock with > some kernel threads or if some awkward locks are held, so the whole > final __fput is delay by default. But this does not apply to nfsd. > Standard fput() is really the wrong interface for nfsd to use. > It should use __fput_sync() (which shouldn't have such a scary name). > > The comment above flush_delayed_fput() seems to suggest that unmounting > is a core issue. Maybe the fact that __fput() can call > dissolve_on_fput() is a reason why it is sometimes safer to leave the > work to later. But I don't see that applying to nfsd. > > Of course a ->release function *could* do synchronous writes just like > the XFS ->destroy_inode function used to do synchronous reads. I had assumed ->release for NFS re-export would flush due to close- to-open semantics. There seem to be numerous corner cases that might result in pile-ups which would change the situation in your problem statement but might not result in an overall improvement. > I don't think we should ever try to hide that by putting it in > a workqueue. It's probably a bug and it is best if bugs are visible. I'm not objecting, per se, to this change. I would simply like to see a little more due diligence before moving forward until it is clear how frequently ->release or ->destroy_inode will do I/O (or "is slow for any reason" as you say above). > Note that the XFS ->release function does call filemap_flush() in some > cases, but that is an async flush, so __fput_sync doesn't wait for the > flush to complete. When Jeff was working on the file cache a year ago, I did some performance analysis that shows even an async flush is costly when there is a lot of dirty data in the file being closed. The VFS walks through the whole file and starts I/O on every dirty page. This is quite CPU intensive, and can take on the order of a millisecond before the async flush request returns to its caller. IME async flushes are not free. > The way I see this patch is that fput() is the wrong interface for nfsd > to use, __fput_sync is the right interface. So we should change. 1 > patch. The practical matter is I see this as a change with a greater than zero risk, and we need to mitigate that risk. Or rather, as a maintainer of NFSD, /I/ need to see that the risk is as minimal as is practical. > The details about exhausting memory explain a particular symptom that > motivated the examination which revealed that nfsd was using the wrong > interface. > > If we have nfsd sometimes using fput() and sometimes __fput_sync, then > we need to have clear rules for when to use which. It is much easier to > have a simple rule: always use __fput_sync(). I don't agree that we should just flop all these over and hope for the best. In particular: - the changes in fs/nfsd/filecache.c appear to revert a bug fix, so I need to see data that shows that change doesn't cause a re-regression - the changes in fs/lockd/ can result in long waits while a global mutex is held (global as in all namespaces and all locked files on the server), so I need to see data that demonstrates there won't be a regression - the other changes don't appear to have motivation in terms of performance or behavior, and carry similar (if lesser) risks as the other two changes. My preferred solution to potential auditor confusion about the use of __fput_sync() in some places and fput() in others is to document, and leave call sites alone if there's no technical reason to change them at this time. There is enough of a risk of regression that I need to see a clear rationale for each hunk /and/ I need to see data that there is no regression. I know that won't be perfect coverage, but it's better than not having any data at all. > I'm certainly happy to revise function documentation and provide > wrapper functions if needed. > > It might be good to have > > void filp_close_sync(struct file *f) > { > get_file(f); > filp_close(f); > __fput_sync(f); > } > > but as that would only be called once, it was hard to motivate. > Having it in linux/fs.h would be nice. > > Similarly we could wrap __fput_sync() in a more friendly name, but > that would be better if we actually renamed the function. > > void fput_now(struct file *f) > { > __fput_sync(f); > } > > ?? Since this is an issue strictly for nfsd, the place for this utility function is in fs/nfsd/vfs.c, IMO, along with a documenting comment that provides a rationale for why nfsd does not want plain fput() in specific cases. When other subsystems need a similar capability, then let's consider a common helper. -- Chuck Lever