On Fri, Aug 26, 2022 at 4:07 PM NeilBrown <neilb@xxxxxxx> wrote: > > As you note, by the end of the series "create" is not more different > from "rename" than it already is. I only broke up the patches to make > review more manageable. Yes, I understand. But I'm saying that maybe a filesystem actually might want to treat them differently. That said, the really nasty part was that 'wq' thing that meant that different paths had different directory locking not because of low-level filesystem issues, but because of caller issues. So that's the one I _really_ disliked, and that I don't think should exist even as a partial first step. The "tie every operation together with one flag" I can live with, in case it turns out that yes, that one flag is all anybody ever really wants. > Alternate option is to never pass in a wq for create operation, and use > var_waitqueue() (or something similar) to provide a global shared wait > queue (which is essentially what I am using to wait for > DCACHE_PAR_UPDATE to clear). I _think_ this is what I would prefer. I say that I _think_ I prefer that, because maybe there are issues with it. But since you basically do that DCACHE_PAR_UPDATE thing anyway, and it's one of the main users of this var_waitqueue, it feels right to me. But then if it just end sup not working well for some practical reason, at that point maybe I'd just say "I was wrong, I thought it would work, but it's better to spread it out to be a per-thread wait-queue on the stack". IOW, my preference would be to simply just try it, knowing that you *can* do the "pass explicit wait-queue down" thing if we need to. Hmm? > > Instead of it being up to the filesystem to say "I can do parallel > > creates, but I need to serialize renames", this whole thing has been > > set up to be about the caller making that decision. > > I think that is a misunderstanding. The caller isn't making a decision > - except the IS_PAR_UPDATE() test which is simply acting on the fs > request. What you are seeing is a misguided attempt to leave in place > some existing interfaces which assumed exclusive locking and didn't > provide wqs. Ok. I still would prefer to have unified locking, not that "do this for one filesystem, do that for another" conditional one. > > (b) aim for the inode lock being taken *after* the _lookup_hash(), > > since the VFS layer side has to be able to handle the concurrency on > > the dcache side anyway > > I think you are suggesting that we change ->lookup call to NOT > require i_rwsem be held. Yes and no. One issue for me is that with your change as-is, then 99% of all people who don't happen to use NFS, the inode lock gives all that VFS code mutual exclusion. Take that lookup_hash_update() function as a practical case: all the *common* filesystems will be running with that function basically 100% serialized per directory, because they'll be doing that inode_lock_nested(dir); ... inode_unlock(dir); around it all. At the same time, all that code is supposed to work even *without* the lock, because once it's a IS_PAR_UPDATE() filesystem, there's effectively no locking at all. What exclusive directory locks even remain at that point? IOW, to me it feels like you are trying to make the code go towards a future with basically no locking at all as far as the VFS layer is concerned (because once all the directory modifications take the inode lock as shared, *all* the inode locking is shared, and is basically a no-op). BUT you are doing so while not having most people even *test* that situation. See what I'm trying to say (but possibly expressing very badly)? So I feel like if the VFS code cannot rely on locking *anyway* in the general case, and should work without it, then we really shouldn't have any locking around any of the VFS operations. The logical conclusion of that would be to push it all down into the filesystem (possibly with the help of a coccinelle script). Now it doesn't have to go that far - at least not initially - but I do think we should at least make sure that as much as possible of the actual VFS code sees that "new world order" of no directory locking, so that that situation gets *tested* as widely as possible. > That is not a small change. Now, that I agree with. I guss we won't get there soon (or ever). But see above what I dislike about the directory locking model change. > It might be nice to take a shared lock in VFS, and let the FS upgrade it > to exclusive if needed, but we don't have upgrade_read() ... maybe it > would be deadlock-prone. Yes, upgrading a read lock is fundamentally impossible and will deadlock trivially (think just two readers that both want to do the upgrade - they'll block each other from doing so). So it's not actually a possible operation. Linus