Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer

"NeilBrown" <neilb@xxxxxxx> · Mon, 20 Jan 2025 09:25:37 +1100

On Mon, 20 Jan 2025, Dave Chinner wrote:
> On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > 
> > My question to fs-devel is: is anyone willing to convert their fs (or
> > advice me on converting?) to use the new interface and do some testing
> > and be open to exploring any bugs that appear?
> 
> tl;dr: You're asking for people to put in a *lot* of time to convert
> complex filesystems to concurrent directory modifications without
> clear indication that it will improve performance. Hence I wouldn't
> expect widespread enthusiasm to suddenly implement it...

Thanks Dave!
Your point as detailed below seems to be that, for xfs at least, it may
be better to reduce hold times for exclusive locks rather than allow
concurrent locks.  That seems entirely credible for a local fs but
doesn't apply for NFS as we cannot get a success status before the
operation is complete.  So it seems likely that different filesystems
will want different approaches.  No surprise.

There is some evidence that ext4 can be converted to concurrent
modification without a lot of work, and with measurable benefits.  I
guess I should focus there for local filesystems.

But I don't want to assume what is best for each file system which is
why I asked for input from developers of the various filesystems.

But even for xfs, I think that to provide a successful return from mkdir
would require waiting for some IO to complete, and that other operations
might benefit from starting before that IO completes.
So maybe an xfs implementation of mkdir_shared would be:
 - take internal exclusive lock on directory
 - run fast foreground part of mkdir
 - drop the lock
 - wait for background stuff, which could affect error return, to
  complete
 - return appropriate error, or success

So xfs could clearly use exclusive locking where that is the best
choice, but not have exclusive locking imposed for the entire operation.
That is my core goal : don't impose a particular locking style - allow
the filesystem to manage locking within an umbrella that ensures the
guarantees that the vfs needs (like no creation inside a directory
during rmdir).

Thanks,
NeilBrown

> 
> In more detail....
> 
> It's not exactly simple to take a directory tree structure that is
> exclusively locked for modification and make it safe for concurrent
> updates. It -might- be possible to make the directory updates in XFS
> more concurrent, but it still has an internal name hash btree index
> that would have to be completely re-written to support concurrent
> updates.
> 
> That's also ignoring all the other bits of the filesystem that will
> single thread outside the directory. e.g. during create we have to
> allocate an inode, and locality algorithms will cluster new inodes
> in the same directory close together. That means they are covered by
> the same exclusive lock (e.g. the AGI and AGF header blocks in XFS).
> Unlink has the same problem.
> 
> IOWs, it's not just directory ops and structures that need locking
> changes; the way filesystems do inode and block allocation and
> freeing also needs to change to support increased concurrency in
> directory operations.
> 
> Hence I suspect that concurrent directory mods for filesystems like
> XFS will need a new processing model - possibly a low overhead
> intent-based modification model using in-memory whiteouts and async
> background batching of intents. We kinda already do this with unlink
> - we do the directory removal in the foreground, and defer the rest
> of the unlink (i.e. inode freeing) to async background worker
> threads.
> 
> e.g. doing background batching of namespace ops means things like
> "rm *" in a directory doesn't need to transactionally modify the
> directory as it runs. We could track all the inodes we are unlinking
> via the intents and then simply truncate away the entire directory
> when it becomes empty and rmdir() is called. We still have to clean
> up and mark all the inodes free, but that can be done in the
> background.
> 
> As such, I suspect that moving XFS to a more async processing model
> for directory namespace ops to minimise lock hold times will be
> simpler (and potentially faster!) than rewriting large chunks of the
> XFS directory and inode management operations to allow for
> i_rwsem/ILOCK/AGI/AGF locking concurrency...
> 
> -Dave.
> -- 
> Dave Chinner
> david@xxxxxxxxxxxxx
>