Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 21 Jan 2025 12:20:22 +1100

On Mon, Jan 20, 2025 at 09:25:37AM +1100, NeilBrown wrote:
> On Mon, 20 Jan 2025, Dave Chinner wrote:
> > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > > 
> > > My question to fs-devel is: is anyone willing to convert their fs (or
> > > advice me on converting?) to use the new interface and do some testing
> > > and be open to exploring any bugs that appear?
> > 
> > tl;dr: You're asking for people to put in a *lot* of time to convert
> > complex filesystems to concurrent directory modifications without
> > clear indication that it will improve performance. Hence I wouldn't
> > expect widespread enthusiasm to suddenly implement it...
> 
> Thanks Dave!
> Your point as detailed below seems to be that, for xfs at least, it may
> be better to reduce hold times for exclusive locks rather than allow
> concurrent locks.  That seems entirely credible for a local fs but
> doesn't apply for NFS as we cannot get a success status before the
> operation is complete.

How is that different from a local filesystem? A local filesystem
can't return from open(O_CREAT) with a struct file referencing a
newly allocated inode until the VFS inode is fully instantiated (or
failed), either...

i.e. this sounds like you want concurrent share-locked dirent ops so
that synchronously processed operations can be issued concurrently.

Could the NFS client implement asynchronous directory ops, keeping
track of the operations in flight without needing to hold the parent
i_rwsem across each individual operation? This basically what I've
been describing for XFS to minimise parent dir lock hold times.

What would VFS support for that look like? Is that of similar
complexity to implementing shared locking support so that concurrent
blocking directory operations can be issued? Is async processing a
better model to move the directory ops towards so we can tie
userspace directly into it via io_uring?

> So it seems likely that different filesystems
> will want different approaches.  No surprise.
> 
> There is some evidence that ext4 can be converted to concurrent
> modification without a lot of work, and with measurable benefits.  I
> guess I should focus there for local filesystems.
> 
> But I don't want to assume what is best for each file system which is
> why I asked for input from developers of the various filesystems.
> 
> But even for xfs, I think that to provide a successful return from mkdir
> would require waiting for some IO to complete, and that other operations

I don't see where IO enters the picture, to be honest. File creation
does not typically require foreground IO on XFS at all (ignoring
dirsync mode). How did you think we scale XFS to near a million file
creates a second? :) 

In the case of mkdir, it does not take a direct reference to the
inode being created so it potentially doesn't even need to wait for
the completion of the operation. i.e. to use the new subdir it has
to be open()d; that means going through the ->lookup path and which
will block on I_NEW until the background creation is completed...

That said, open(O_CREAT) would need to call wait_on_inode()
somewhere to wait for I_NEW to clear so operations on the inode can
proceed immediately via the persistent struct file reference it
creates.  With the right support, that waiting can be done without
holding the parent directory locked, as any new lookup on that
dirent/inode pair will block until I_NEW is cleared...

Hence my question above about what does VFS support for
async dirops actually looks like, and whether something like this:

> might benefit from starting before that IO completes.
> So maybe an xfs implementation of mkdir_shared would be:
>  - take internal exclusive lock on directory
>  - run fast foreground part of mkdir
>  - drop the lock
>  - wait for background stuff, which could affect error return, to
>   complete
>  - return appropriate error, or success

as natively supported functionality might be a better solution to
the problem....