On Sat, 27 Aug 2022, Linus Torvalds wrote: > On Thu, Aug 25, 2022 at 7:16 PM NeilBrown <neilb@xxxxxxx> wrote: > > > > If a filesystem supports parallel modification in directories, it sets > > FS_PAR_DIR_UPDATE on the file_system_type. lookup_open() and the new > > lookup_hash_update() notice the flag and take a shared lock on the > > directory, and rely on a lock-bit in d_flags, much like parallel lookup > > relies on DCACHE_PAR_LOOKUP. > > Ugh. Thanks :-) - no, really - thanks for the high-level review! > > I absolutely believe in the DCACHE_PAR_LOOKUP model, and in "parallel > updates" being important, but I *despise* locking code like this > > + if (wq && IS_PAR_UPDATE(dir)) > + inode_lock_shared_nested(dir, I_MUTEX_PARENT); > + else > + inode_lock_nested(dir, I_MUTEX_PARENT); > > and I really really hope there's some better model for this. > > That "wq" test in particular is just absolutely disgusting. So now it > doesn't just depend on whether the directory supports parallel > updates, now the caller can choose whether to do the parallel thing or > not, and that's how "create" is different from "rename". As you note, by the end of the series "create" is not more different from "rename" than it already is. I only broke up the patches to make review more manageable. The "wq" can be removed. There are two options. One is to change every kern_path_create() or user_path_create() caller to passed in a wq. Then we can assume that a wq is always available. There are about a dozen of these calls, so not an enormous change, but one that I didn't want to think about just yet. I could add a patch at the front of the series which did this. Alternate option is to never pass in a wq for create operation, and use var_waitqueue() (or something similar) to provide a global shared wait queue (which is essentially what I am using to wait for DCACHE_PAR_UPDATE to clear). The more I think about it, the more I think this is the best way forward. Maybe we'll want to increase WAIT_TABLE_BITS ... I wonder how to measure how much contention there is on these shared queues. > > And that last difference is, I think, the one that makes me go "No. HELL NO". > > Instead of it being up to the filesystem to say "I can do parallel > creates, but I need to serialize renames", this whole thing has been > set up to be about the caller making that decision. I think that is a misunderstanding. The caller isn't making a decision - except the IS_PAR_UPDATE() test which is simply acting on the fs request. What you are seeing is a misguided attempt to leave in place some existing interfaces which assumed exclusive locking and didn't provide wqs. > > That's just feels horribly horribly wrong. > > Yes, I realize that to you that's just a "later patches will do > renames", but what if it really is a filesystem issue where the > filesystem can easily handle new names, but needs something else for > renames because it has various cross-directory issues, perhaps? Obviously a filesystem can add its own locking - and they will have to, though at a finer grain that the VFS can do. > > So I feel this is fundamentally wrong, and this ugliness is a small > effect of that wrongness. > > I think we should strive for > > (a) make that 'wq' and DCACHE_PAR_LOOKUP bit be unconditional Agreed (in an earlier version DCACHE_PAR_LOOKUP was optional, but I realised that you wouldn't like that :-) > > (b) aim for the inode lock being taken *after* the _lookup_hash(), > since the VFS layer side has to be able to handle the concurrency on > the dcache side anyway I think you are suggesting that we change ->lookup call to NOT require i_rwsem be held. That is not a small change. I agree that it makes sense in the long term. Getting there .... won't be a quick as I'd hoped. > > (c) at that point, the serialization really ends up being about the > call into the filesystem, and aim to simply move the > inode_lock{_shared]_nested() into the filesystem so that there's no > need for a flag and related conditional locking at all. It might be nice to take a shared lock in VFS, and let the FS upgrade it to exclusive if needed, but we don't have upgrade_read() ... maybe it would be deadlock-prone. > > Because right now I think the main reason we cannot move the lock into > the filesystem is literally that we've made the lock cover not just > the filesystem part, but the "lookup and create dentry" part too. > > But once you have that "DCACHE_PAR_LOOKUP" bit and the > d_alloc_parallel() logic to serialize a _particular_ dentry being > created (as opposed to serializing all the sleeping ops to that > directly), I really think we should strive to move the locking - that > no longer helps the VFS dcache layer - closer to the filesystem call > and eventually into it. > > Please? I think these patches are "mostly going in the right > direction", but at the same time I feel like there's some serious > mis-design going on. Hmmm.... I'll dig more deeply into ->lookup and see if I can understand the locking well enough to feel safe removing i_rwsem from it. Thanks, NeilBrown