Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 22 Jan 2025 11:22:02 +1100

On Tue, Jan 21, 2025 at 08:04:46AM -0500, Jeff Layton wrote:
> On Tue, 2025-01-21 at 12:20 +1100, Dave Chinner wrote:
> > On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote:
> > > On Mon, 20 Jan 2025, Dave Chinner wrote:
> > > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > > > > 
> > > > > My question to fs-devel is: is anyone willing to convert their fs (or
> > > > > advice me on converting?) to use the new interface and do some testing
> > > > > and be open to exploring any bugs that appear?
> > > > 
> > > > tl;dr: You're asking for people to put in a *lot* of time to convert
> > > > complex filesystems to concurrent directory modifications without
> > > > clear indication that it will improve performance. Hence I wouldn't
> > > > expect widespread enthusiasm to suddenly implement it...
> > > 
> > > Thanks Dave!
> > > Your point as detailed below seems to be that, for xfs at least, it may
> > > be better to reduce hold times for exclusive locks rather than allow
> > > concurrent locks.  That seems entirely credible for a local fs but
> > > doesn't apply for NFS as we cannot get a success status before the
> > > operation is complete.
> > 
> > How is that different from a local filesystem? A local filesystem
> > can't return from open(O_CREAT) with a struct file referencing a
> > newly allocated inode until the VFS inode is fully instantiated (or
> > failed), either...
> > 
> > i.e. this sounds like you want concurrent share-locked dirent ops so
> > that synchronously processed operations can be issued concurrently.
> > 
> > Could the NFS client implement asynchronous directory ops, keeping
> > track of the operations in flight without needing to hold the parent
> > i_rwsem across each individual operation? This basically what I've
> > been describing for XFS to minimise parent dir lock hold times.
> > 
> 
> Yes, basically. The protocol and NFS client have no requirement to
> serialize directory operations. We'd be happy to spray as many at the
> server in parallel as we can get away with. We currently don't do that
> today, largely because the VFS prohibits it.
>
> The NFS server, or exported filesystem may have requirements that
> serialize these operations though.

Sure, 
> 
> > What would VFS support for that look like? Is that of similar
> > complexity to implementing shared locking support so that concurrent
> > blocking directory operations can be issued? Is async processing a
> > better model to move the directory ops towards so we can tie
> > userspace directly into it via io_uring?
> 
> Given that the VFS requires an exclusive lock today for directory
> morphing ops, moving to a model where we can take a shared lock on the
> directory instead seems like a nice, incremental approach to dealing
> with this problem.

I understand this, but it's not really "incremental" in that it
entrenches the "concurrency is only possible for synchronous
processing models" that shared locking across the entire operation
entails.

i.e. we can't hand the shared lock to another thread to release on
completion (e.g. an async processing pipeline) because lockdep will
get upset about that and {down,up}_read_non_owner() is very much
discouraged and {down,up}_write_non_owner() doesn't even exist.

> That said, I get your objection. Not being able to upgrade a rwsem
> makes that shared lock kind of nasty for filesystems that actually do
> rely on it for some parts of their work today.

It's more than that - the use of a rwsem for exclusion basically
forces us into "task that takes the lock must release the lock"
model, and so if the VFS expects the entire operation to be done
under a down_read() context then we must complete the entire
operation before handing back the completion to the original task.

That's why I'm pushing back against a "whole task" centric shared
locking model as it is ultimately unfriendly to efficient async
dispatch and completion of background tasks.

> The usual method of dealing with that would be to create a new XFS-only
> per-inode lock that would take over that serialization. The nice thing
> there is that you could (over time) reduce its scope.

As I've already said: we already have an internal per-inode rwsem in
XFS for protecting physical directory operations against races.
Suggesting this is the solution is missing the point I was trying to
make: that it doesn't actually solve anything and the better
solution for filesystems like XFS is to decouple the front end VFS
serialisation requirements from the back end filesystem
implementation.

That's kinda my point: this isn't an "XFS-only" problem - it's
something that almost every filesystem we have is going to have
problems with. I very much doubt btrfs will be able to do concurrent
directory mods due to it's namespace btree exclusion model, and I
suspect that bcachefs is going to have the same issues, too.

Hence I think shared locking is fundamentally the wrong model here -
the problem that we need to address is the excessive latency of
synchronous back end FS ops, not the lack of concurrency in
processing directory modifications. Yes, a lack of back end
filesytsem concurrency contributes to the excessive latency of
synchronous directory modification, but that doesn't mean that the
best solution to the problem is to change the concurrency model.

i.e. If we have to make the same mods to every filesystem to do
background async processing to take any sort of advantage of shared
locking, and those background processing mods don't actually require
a shared locking model to realise the performance benefits, then why
add the complexity of a "shared lock for modification" model in the
first place?

[snip stuff about how to decouple VFS/fs serialisation]

> So maybe we'd have something like:
> 
> struct mkdir_context {
> 	struct mnt_idmap	*idmap;	// current args to mkdir op
> 	struct inode		*dir;
> 	struct dentry		*dentry;
> 	umode_t			mode;
> 	int			status		// return status
> 	struct completion	complete;	// when done -- maybe this would be completion callback function?
> };
> 
> ...and an inode operation like:
> 
> 	int (*mkdir_begin)(struct mkdir_context *ctx);
> 
> Then the fs could just pass a pointer to that around, and when the
> operation is done, complete the variable, which would make the original
> task that called mkdir() finalize the results of the op?

That's still serial/synchronous task processing of the entire
operation and does nothing to hide the latency of the operation from
the user.

As I've already explained, this is not the model I've been talking
about.

Yes, we'd need a dynamically allocated control structure that
defines the inode instantiation task that needs completing, but
there's no need for completions within it as the caller can call
wait_on_inode() to wait for the async inode instantiation to
complete.

If the instantiation fails, then we mark the inode bad, stash the
errno somewhere in the inode, and the waiter then grabs the errno
and tears down the inode to clean up the mess.

> My worry here would be that dealing with the workqueue and context
> switching would be a performance killer in the simple cases that do it
> all in a single thread today.

Catch-22.

The premise behind shared locking is that mkdir operations block
(i.e. context switch multiple times) and so have high latency and
long lock hold times. Therefore we need shared locking to be able to
issue lots of them concurrently to get bulk directory performance.

The argument now being made is that adding context switches to mkdir
operations that -don't block- (i.e. the opposite behaviour that
shared locking is trying to adress) will cause performance problems.

i.e. The "simple cases that do it all in a single thread today" are
blocking and taking multiple context switches on every operation.
e.g. nfs client submits to the server, has to wait for reply, so
there's two context switches per operation.

This misses the point of doing async background processing: it
removes the blocking from the foreground task until it is absolutely
necessary, hence helping the foreground process *not block* whilst
holding the parent directory lock and so reduce lock hold times and
increase the number of ops that can be done under that exclusive
lock.

Using shared locking doesn't change the context switch overhead.
Using async processing doesn't make the context switch overhead
worse.  What changes is where those context switches occur and how
much foreground task and lock latency they result in. And in some
cases, async processing drastically reduces context switch overhead
because it allows for subsystems to batch process pending
operations.

Anyone who has been following io_uring development should know all
these things about async processing already. There's a reason that
that infrastructure exists: async processing is more efficient and
faster than the concurrent synchronous processing model being
proposed here....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx