On Tue, Jan 21, 2025 at 08:04:46AM -0500, Jeff Layton wrote: > On Tue, 2025-01-21 at 12:20 +1100, Dave Chinner wrote: > > On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote: > > > On Mon, 20 Jan 2025, Dave Chinner wrote: > > > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote: > > > > > > > > > > My question to fs-devel is: is anyone willing to convert their fs (or > > > > > advice me on converting?) to use the new interface and do some testing > > > > > and be open to exploring any bugs that appear? > > > > > > > > tl;dr: You're asking for people to put in a *lot* of time to convert > > > > complex filesystems to concurrent directory modifications without > > > > clear indication that it will improve performance. Hence I wouldn't > > > > expect widespread enthusiasm to suddenly implement it... > > > > > > Thanks Dave! > > > Your point as detailed below seems to be that, for xfs at least, it may > > > be better to reduce hold times for exclusive locks rather than allow > > > concurrent locks. That seems entirely credible for a local fs but > > > doesn't apply for NFS as we cannot get a success status before the > > > operation is complete. > > > > How is that different from a local filesystem? A local filesystem > > can't return from open(O_CREAT) with a struct file referencing a > > newly allocated inode until the VFS inode is fully instantiated (or > > failed), either... > > > > i.e. this sounds like you want concurrent share-locked dirent ops so > > that synchronously processed operations can be issued concurrently. > > > > Could the NFS client implement asynchronous directory ops, keeping > > track of the operations in flight without needing to hold the parent > > i_rwsem across each individual operation? This basically what I've > > been describing for XFS to minimise parent dir lock hold times. > > > > Yes, basically. The protocol and NFS client have no requirement to > serialize directory operations. We'd be happy to spray as many at the > server in parallel as we can get away with. We currently don't do that > today, largely because the VFS prohibits it. > > The NFS server, or exported filesystem may have requirements that > serialize these operations though. Sure, > > > What would VFS support for that look like? Is that of similar > > complexity to implementing shared locking support so that concurrent > > blocking directory operations can be issued? Is async processing a > > better model to move the directory ops towards so we can tie > > userspace directly into it via io_uring? > > Given that the VFS requires an exclusive lock today for directory > morphing ops, moving to a model where we can take a shared lock on the > directory instead seems like a nice, incremental approach to dealing > with this problem. I understand this, but it's not really "incremental" in that it entrenches the "concurrency is only possible for synchronous processing models" that shared locking across the entire operation entails. i.e. we can't hand the shared lock to another thread to release on completion (e.g. an async processing pipeline) because lockdep will get upset about that and {down,up}_read_non_owner() is very much discouraged and {down,up}_write_non_owner() doesn't even exist. > That said, I get your objection. Not being able to upgrade a rwsem > makes that shared lock kind of nasty for filesystems that actually do > rely on it for some parts of their work today. It's more than that - the use of a rwsem for exclusion basically forces us into "task that takes the lock must release the lock" model, and so if the VFS expects the entire operation to be done under a down_read() context then we must complete the entire operation before handing back the completion to the original task. That's why I'm pushing back against a "whole task" centric shared locking model as it is ultimately unfriendly to efficient async dispatch and completion of background tasks. > The usual method of dealing with that would be to create a new XFS-only > per-inode lock that would take over that serialization. The nice thing > there is that you could (over time) reduce its scope. As I've already said: we already have an internal per-inode rwsem in XFS for protecting physical directory operations against races. Suggesting this is the solution is missing the point I was trying to make: that it doesn't actually solve anything and the better solution for filesystems like XFS is to decouple the front end VFS serialisation requirements from the back end filesystem implementation. That's kinda my point: this isn't an "XFS-only" problem - it's something that almost every filesystem we have is going to have problems with. I very much doubt btrfs will be able to do concurrent directory mods due to it's namespace btree exclusion model, and I suspect that bcachefs is going to have the same issues, too. Hence I think shared locking is fundamentally the wrong model here - the problem that we need to address is the excessive latency of synchronous back end FS ops, not the lack of concurrency in processing directory modifications. Yes, a lack of back end filesytsem concurrency contributes to the excessive latency of synchronous directory modification, but that doesn't mean that the best solution to the problem is to change the concurrency model. i.e. If we have to make the same mods to every filesystem to do background async processing to take any sort of advantage of shared locking, and those background processing mods don't actually require a shared locking model to realise the performance benefits, then why add the complexity of a "shared lock for modification" model in the first place? [snip stuff about how to decouple VFS/fs serialisation] > So maybe we'd have something like: > > struct mkdir_context { > struct mnt_idmap *idmap; // current args to mkdir op > struct inode *dir; > struct dentry *dentry; > umode_t mode; > int status // return status > struct completion complete; // when done -- maybe this would be completion callback function? > }; > > ...and an inode operation like: > > int (*mkdir_begin)(struct mkdir_context *ctx); > > Then the fs could just pass a pointer to that around, and when the > operation is done, complete the variable, which would make the original > task that called mkdir() finalize the results of the op? That's still serial/synchronous task processing of the entire operation and does nothing to hide the latency of the operation from the user. As I've already explained, this is not the model I've been talking about. Yes, we'd need a dynamically allocated control structure that defines the inode instantiation task that needs completing, but there's no need for completions within it as the caller can call wait_on_inode() to wait for the async inode instantiation to complete. If the instantiation fails, then we mark the inode bad, stash the errno somewhere in the inode, and the waiter then grabs the errno and tears down the inode to clean up the mess. > My worry here would be that dealing with the workqueue and context > switching would be a performance killer in the simple cases that do it > all in a single thread today. Catch-22. The premise behind shared locking is that mkdir operations block (i.e. context switch multiple times) and so have high latency and long lock hold times. Therefore we need shared locking to be able to issue lots of them concurrently to get bulk directory performance. The argument now being made is that adding context switches to mkdir operations that -don't block- (i.e. the opposite behaviour that shared locking is trying to adress) will cause performance problems. i.e. The "simple cases that do it all in a single thread today" are blocking and taking multiple context switches on every operation. e.g. nfs client submits to the server, has to wait for reply, so there's two context switches per operation. This misses the point of doing async background processing: it removes the blocking from the foreground task until it is absolutely necessary, hence helping the foreground process *not block* whilst holding the parent directory lock and so reduce lock hold times and increase the number of ops that can be done under that exclusive lock. Using shared locking doesn't change the context switch overhead. Using async processing doesn't make the context switch overhead worse. What changes is where those context switches occur and how much foreground task and lock latency they result in. And in some cases, async processing drastically reduces context switch overhead because it allows for subsystems to batch process pending operations. Anyone who has been following io_uring development should know all these things about async processing already. There's a reason that that infrastructure exists: async processing is more efficient and faster than the concurrent synchronous processing model being proposed here.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx