> On Nov 2, 2022, at 2:07 PM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > > >> On Nov 2, 2022, at 12:58 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >> >> On Tue, 2022-11-01 at 18:42 -0400, Jeff Layton wrote: >>> On Wed, 2022-11-02 at 09:05 +1100, NeilBrown wrote: >>>> On Wed, 02 Nov 2022, Jeff Layton wrote: >>>>> On Wed, 2022-11-02 at 08:23 +1100, NeilBrown wrote: >>>>>> On Wed, 02 Nov 2022, Jeff Layton wrote: >>>>>>> The filecache refcounting is a bit non-standard for something searchable >>>>>>> by RCU, in that we maintain a sentinel reference while it's hashed. This >>>>>>> in turn requires that we have to do things differently in the "put" >>>>>>> depending on whether its hashed, which we believe to have led to races. >>>>>>> >>>>>>> There are other problems in here too. nfsd_file_close_inode_sync can end >>>>>>> up freeing an nfsd_file while there are still outstanding references to >>>>>>> it, and there are a number of subtle ToC/ToU races. >>>>>>> >>>>>>> Rework the code so that the refcount is what drives the lifecycle. When >>>>>>> the refcount goes to zero, then unhash and rcu free the object. >>>>>>> >>>>>>> With this change, the LRU carries a reference. Take special care to >>>>>>> deal with it when removing an entry from the list. >>>>>> >>>>>> The refcounting and lru management all look sane here. >>>>>> >>>>>> You need to have moved the final put (and corresponding fsync) to >>>>>> different threads. I think I see you and Chuck discussing that and I >>>>>> have no sense of what is "right". >>>>>> >>>>> >>>>> Yeah, this is a tough call. I get Chuck's reticence. >>>>> >>>>> One thing we could consider is offloading the SYNC_NONE writeback >>>>> submission to a workqueue. I'm not sure though whether that's a win -- >>>>> it might just add needless context switches. OTOH, that would make it >>>>> fairly simple to kick off writeback when the REFERENCED flag is cleared, >>>>> which would probably be the best time to do it. >>>>> >>>>> An entry that ends up being harvested by the LRU scanner is going to be >>>>> touched by it at least twice: once to clear the REFERENCED flag, and >>>>> again ~2s later to reap it. >>>>> >>>>> If we schedule writeback when we clear the flag then we have a pretty >>>>> good indication that nothing else is going to be using it (though I >>>>> think we need to clear REFERENCED even when nfsd_file_check_writeback >>>>> returns true -- I'll fix that in the coming series). >>>>> >>>>> In any case, I'd probably like to do that sort of change in a separate >>>>> series after we get the first part sorted. >>>>> >>>>>> But it would be nice to explain in >>>>>> the comment what is being moved and why, so I could then confirm that >>>>>> the code matches the intent. >>>>>> >>>>> >>>>> I'm happy to add comments, but I'm a little unclear on what you're >>>>> confused by here. It's a bit too big of a patch for me to give a full >>>>> play-by-play description. Can you elaborate on what you'd like to see? >>>>> >>>> >>>> I don't need blow-by-blow, but all the behavioural changes should at >>>> least be flagged in the intro, and possibly explained. >>>> The one I particularly noticed is in nfsd_file_close_inode() which >>>> previously used nfsd_file_dispose_list() which hands the final close off >>>> to nfsd_filecache_wq. >>>> But this patch now does the final close in-line so an fsnotify event >>>> might now do the fsync. I was assuming that was deliberate and wanted >>>> it to be explained. But maybe it wasn't deliberate? >>>> >>> >>> Good catch! That wasn't a deliberate change, or at least I missed the >>> subtlety that the earlier code attempted to avoid it. fsnotify callbacks >>> are run under the srcu_read_lock. I don't think we want to run a fsync >>> under that if we can at all help it. >>> >>> What we can probably do is unhash it and dequeue it from the LRU, and >>> then do a refcount_dec_and_test. If that comes back true, we can then >>> queue it to the nfsd_fcache_disposal infrastructure to be closed and >>> freed. I'll have a look at that tomorrow. >>> >> >> Ok, I looked over the notification handling in here again and there is a >> bit of a dilemma: >> >> Neil is correct that we currently just put the reference directly in the >> notification callback, and if we put the last reference at that point >> then we could end up waiting on writeback. > > I expect that for an unlink, that is the common case. > > >> There are two notification callbacks: >> >> 1/ fanotify callback to tell us when the link count on a file has gone >> to 0. >> >> 2/ the setlease_notifier which is called when someone wants to do a >> setlease >> >> ...both are called under the srcu_read_lock(), and they are both fairly >> similar situations. We call different functions for them today, but we >> may be OK to unify them since their needs are similar. >> >> When I originally added these, I made them synchronous because it's best >> if nfsd can clean up and get out the way quickly when either of these >> events happen. At that time though, the code didn't call vfs_fsync at >> all, much less always on the last put. >> >> We have a couple of options: >> >> 1/ just continue to do them synchronously: We can sleep under the >> srcu_read_lock, so we can do both of those synchronously, but blocking >> it for a long period of time could cause latency issues elsewhere. >> >> 2/ move them to the delayed freeing infrastructure. That should be fine >> but we could end doing stuff like silly renaming when someone deletes an >> open file on an NFS reexport. > > Isn't the NFS re-export case handled already by nfsd_file_close_inode_sync() ? > In that case, the fsync / close is done synchronously before the unlink, but > the caller is an nfsd thread, so that should be safe. > > >> Thoughts? What's the best approach here? >> >> Maybe we should just leave them synchronous for now, and plan to address >> this in a later set? > > I need to collect some experimental evidence, but we shouldn't be adding > or removing notification calls with your patch set. That wasn't clear. I meant: as I read it, your current patch set doesn't add or remove notification calls. Sorry for the unnecessary subjunctive. > It ought to be safe > to leave it for a subsequent fix. > > >>>> The movement of flush_delayed_fput() threw me at first, but I think I >>>> understand it now - the new code for close_inode_sync is much cleaner, >>>> not needing dispose_list_sync. >>>> >>> >>> Yep, I think this is cleaner too. >>> >> >> -- >> Jeff Layton <jlayton@xxxxxxxxxx> > > -- > Chuck Lever -- Chuck Lever