> On Oct 28, 2022, at 5:03 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Fri, 2022-10-28 at 20:39 +0000, Chuck Lever III wrote: >> >>> On Oct 28, 2022, at 4:13 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >>> >>> On Fri, 2022-10-28 at 19:49 +0000, Chuck Lever III wrote: >>>> >>>>> On Oct 28, 2022, at 2:57 PM, Jeff Layton <jlayton@xxxxxxxxxx> wrote: >>>>> >>>>> The filecache refcounting is a bit non-standard for something searchable >>>>> by RCU, in that we maintain a sentinel reference while it's hashed. This >>>>> in turn requires that we have to do things differently in the "put" >>>>> depending on whether its hashed, which we believe to have led to races. >>>>> >>>>> There are other problems in here too. nfsd_file_close_inode_sync can end >>>>> up freeing an nfsd_file while there are still outstanding references to >>>>> it, and there are a number of subtle ToC/ToU races. >>>>> >>>>> Rework the code so that the refcount is what drives the lifecycle. When >>>>> the refcount goes to zero, then unhash and rcu free the object. >>>>> >>>>> With this change, the LRU carries a reference. Take special care to >>>>> deal with it when removing an entry from the list. >>>> >>>> I can see a way of making this patch a lot cleaner. It looks like there's >>>> a fair bit of renaming and moving of functions -- that can go in clean >>>> up patches before doing the heavy lifting. >>>> >>> >>> Is this something that's really needed? I'm already basically rewriting >>> this code. Reshuffling the old code around first will take a lot of time >>> and we'll still end up with the same result. >> >> I did exactly this for the nfs4_file rhash changes. It took just a couple >> of hours. The outcome is that you can see exactly, in the final patch in >> that series, how the file_hashtbl -> rhltable substitution is done. >> >> Making sure each of the changes is more or less mechanical and obvious >> is a good way to ensure no-one is doing something incorrect. That's why >> folks like to use cocchinelle. >> >> Trust me, it will be much easier to figure out in a year when we have >> new bugs in this code if we split up this commit just a little. >> > > Sigh. It seems pointless to rearrange code that is going to be replaced, > but I'll do it. It'll probably be next week though. > >> >>>> I'm still not sold on the idea of a synchronous flush in nfsd_file_free(). >>> >>> I think that we need to call this there to ensure that writeback errors >>> are handled. I worry that if try to do this piecemeal, we could end up >>> missing errors when they fall off the LRU. >>> >>>> That feels like a deadlock waiting to happen and quite difficult to >>>> reproduce because I/O there is rarely needed. It could help to put a >>>> might_sleep() in nfsd_file_fsync(), at least, but I would prefer not to >>>> drive I/O in that path at all. >>> >>> I don't quite grok the potential for a deadlock here. nfsd_file_free >>> already has to deal with blocking activities due to it effective doing a >>> close(). This is just another one. That's why nfsd_file_put has a >>> might_sleep in it (to warn its callers). >> >> Currently nfsd_file_put() calls nfsd_file_flush(), which calls >> vfs_fsync(). That can't be called while holding a spinlock. >> >> > > nfsd_file_free (and hence, nfsd_file_put) can never be called with a > spinlock held. That's true even before this set. Both functions can > block. Dead horse: in the current code base, nfsd_file_free() can be called via nfsd_file_close_inode_sync(), which is an API external to filecache.c. But, I agree now that both functions can block. >>> What's the deadlock scenario you envision? >> >> OK, filp_close() does call f_op->flush(). So we have this call >> here and there aren't problems today. I still say this is a >> problem waiting to occur, but I guess I can live with it. >> >> If filp_close() already calls f_op->flush(), why do we need an >> explicit vfs_fsync() there? >> >> > > ->flush doesn't do anything on some filesystems, and while it does > return an error code today, it should have been a void return function. > The error from it can't be counted on. OK. The goal is detecting writeback errors, and ->flush is not a reliable way to do that. > vfs_fsync is what ensures that everything gets written back and returns > info about writeback errors. I don't see a way around calling it at > least once before we close a file, if we want to keep up the "trick" of > resetting the verifier when we see errors. Fair enough, but maybe it's not worth a complexity and performance impact. Writeback errors are supposed to be rare compared to I/O. If we can find another way (especially, a more reliable way) to monitor for writeback errors, that might be an improvement over using vfs_fsync, which seems heavyweight no matter how you cut it. > IMO, the goal ought to be to ensure that we don't end up having to do > any writeback when we get to GC'ing it, and that's what patch 4/4 should > do. Understood. I don't disagree with the goal, since filp_close() won't be able to avoid a potentially costly ->flush in some cases. The issue is even a SYNC_NONE flush has appreciable cost. -- Chuck Lever