Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer

Jeff Layton <jlayton@xxxxxxxxxx> · Tue, 21 Jan 2025 08:04:46 -0500

On Tue, 2025-01-21 at 12:20 +1100, Dave Chinner wrote:
> On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote:
> > On Mon, 20 Jan 2025, Dave Chinner wrote:
> > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > > > 
> > > > My question to fs-devel is: is anyone willing to convert their fs (or
> > > > advice me on converting?) to use the new interface and do some testing
> > > > and be open to exploring any bugs that appear?
> > > 
> > > tl;dr: You're asking for people to put in a *lot* of time to convert
> > > complex filesystems to concurrent directory modifications without
> > > clear indication that it will improve performance. Hence I wouldn't
> > > expect widespread enthusiasm to suddenly implement it...
> > 
> > Thanks Dave!
> > Your point as detailed below seems to be that, for xfs at least, it may
> > be better to reduce hold times for exclusive locks rather than allow
> > concurrent locks.  That seems entirely credible for a local fs but
> > doesn't apply for NFS as we cannot get a success status before the
> > operation is complete.
> 
> How is that different from a local filesystem? A local filesystem
> can't return from open(O_CREAT) with a struct file referencing a
> newly allocated inode until the VFS inode is fully instantiated (or
> failed), either...
> 
> i.e. this sounds like you want concurrent share-locked dirent ops so
> that synchronously processed operations can be issued concurrently.
> 
> Could the NFS client implement asynchronous directory ops, keeping
> track of the operations in flight without needing to hold the parent
> i_rwsem across each individual operation? This basically what I've
> been describing for XFS to minimise parent dir lock hold times.
> 

Yes, basically. The protocol and NFS client have no requirement to
serialize directory operations. We'd be happy to spray as many at the
server in parallel as we can get away with. We currently don't do that
today, largely because the VFS prohibits it.

The NFS server, or exported filesystem may have requirements that
serialize these operations though.

> What would VFS support for that look like? Is that of similar
> complexity to implementing shared locking support so that concurrent
> blocking directory operations can be issued? Is async processing a
> better model to move the directory ops towards so we can tie
> userspace directly into it via io_uring?
> 

Given that the VFS requires an exclusive lock today for directory
morphing ops, moving to a model where we can take a shared lock on the
directory instead seems like a nice, incremental approach to dealing
with this problem.

That said, I get your objection. Not being able to upgrade a rwsem
makes that shared lock kind of nasty for filesystems that actually do
rely on it for some parts of their work today.

The usual method of dealing with that would be to create a new XFS-only
per-inode lock that would take over that serialization. The nice thing
there is that you could (over time) reduce its scope.

> > So it seems likely that different filesystems
> > will want different approaches.  No surprise.
> > 
> > There is some evidence that ext4 can be converted to concurrent
> > modification without a lot of work, and with measurable benefits.  I
> > guess I should focus there for local filesystems.
> > 
> > But I don't want to assume what is best for each file system which is
> > why I asked for input from developers of the various filesystems.
> > 
> > But even for xfs, I think that to provide a successful return from mkdir
> > would require waiting for some IO to complete, and that other operations
> 
> I don't see where IO enters the picture, to be honest. File creation
> does not typically require foreground IO on XFS at all (ignoring
> dirsync mode). How did you think we scale XFS to near a million file
> creates a second? :) 
> 
> In the case of mkdir, it does not take a direct reference to the
> inode being created so it potentially doesn't even need to wait for
> the completion of the operation. i.e. to use the new subdir it has
> to be open()d; that means going through the ->lookup path and which
> will block on I_NEW until the background creation is completed...
> 
> That said, open(O_CREAT) would need to call wait_on_inode()
> somewhere to wait for I_NEW to clear so operations on the inode can
> proceed immediately via the persistent struct file reference it
> creates.  With the right support, that waiting can be done without
> holding the parent directory locked, as any new lookup on that
> dirent/inode pair will block until I_NEW is cleared...
> 
> Hence my question above about what does VFS support for
> async dirops actually looks like, and whether something like this:
> 
> > might benefit from starting before that IO completes.
> > So maybe an xfs implementation of mkdir_shared would be:
> >  - take internal exclusive lock on directory
> >  - run fast foreground part of mkdir
> >  - drop the lock
> >  - wait for background stuff, which could affect error return, to
> >   complete
> >  - return appropriate error, or success
> 
> as natively supported functionality might be a better solution to
> the problem....
> 
> From the XFs perspective, we already have internal exclusive locking
> for dirent mods, but that is needed for serialising the physical
> directory mods against other (shared access) readers (e.g. lookup
> and readdir).  We would not want to be sharing such an internal lock
> across the unbound fast path concurrency of lookup, create, unlink,
> readdir -and- multiple background processing threads.
> 
> Logging a modification intent against an inode wouldn't require a
> new internal inode lock; AFAICT all it requires is exclusivity
> against lookup so that lookup can find new/unlinked dirents that
> have been logged but not yet added to or removed from the physical
> directory blocks.
> 
> We can do that by inserting the VFS inode into cache in the
> foreground operation and leaving I_NEW set on create.  We can do a
> similar thing with unlink (I_WILL_FREE?) to make the VFS inode
> invisible to new lookups before the unlink has actually been
> processed. In this way, we can push the actual processing into
> background work queues without actually changing how the namespace
> behaves from a user perspective...
> 
> We -might- be able to do all these operations under a shared VFS
> lock, but it is not necessary to have a shared VFS lock to enable
> such a async processing model. If the performance gains for the NFS
> client come from allowing concurrent processing of individual
> synchronous operations, then we really need to consider if there are
> other methods of hiding synchronous operation latency that might
> be more applicable to more filesystems and high performance user
> interfaces...
> 

So maybe we'd have something like:

struct mkdir_context {
	struct mnt_idmap	*idmap;	// current args to mkdir op
	struct inode		*dir;
	struct dentry		*dentry;
	umode_t			mode;
	int			status		// return status
	struct completion	complete;	// when done -- maybe this would be completion callback function?
};

...and an inode operation like:

	int (*mkdir_begin)(struct mkdir_context *ctx);

Then the fs could just pass a pointer to that around, and when the
operation is done, complete the variable, which would make the original
task that called mkdir() finalize the results of the op?

My worry here would be that dealing with the workqueue and context
switching would be a performance killer in the simple cases that do it
all in a single thread today.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>