Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 20 Jan 2025 08:51:50 +1100

On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> On Sat, 18 Jan 2025, Jeff Layton wrote:
> > We've hit a number of cases in testing recently where the parent's
> > i_rwsem ends up being the bottleneck in heavy parallel create
> > workloads. Currently we have to take the parent's inode->i_rwsem
> > exclusively when altering a directory, which means that any directory-
> > morphing operations in the same directory are serialized.
> > 
> > This is particularly onerous in the ->create codepath, since a
> > filesystem may have to do a number of blocking operations to create a
> > new file (allocate memory, start a transaction, etc.)
> > 
> > Neil recently posted this RFC series, which allows parallel directory
> > modifying operations:
> > 
> >     https://lore.kernel.org/linux-fsdevel/20241220030830.272429-1-neilb@xxxxxxx/
> > 
> > Al pointed out a number of problems in it, but the basic approach seems
> > sound. I'd like to have a discussion at LSF/MM about this.
> > 
> > Are there any problems with the basic approach? Are there other
> > approaches that might be better? Are there incremental steps we could
> > do pave the way for this to be a reality?
> 
> Thanks for raising this!
> There was at least one problem with the approach but I have a plan to
> address that.  I won't go into detail here.  I hope to get a new
> patch set out sometime in the coming week.
> 
> My question to fs-devel is: is anyone willing to convert their fs (or
> advice me on converting?) to use the new interface and do some testing
> and be open to exploring any bugs that appear?

tl;dr: You're asking for people to put in a *lot* of time to convert
complex filesystems to concurrent directory modifications without
clear indication that it will improve performance. Hence I wouldn't
expect widespread enthusiasm to suddenly implement it...

In more detail....

It's not exactly simple to take a directory tree structure that is
exclusively locked for modification and make it safe for concurrent
updates. It -might- be possible to make the directory updates in XFS
more concurrent, but it still has an internal name hash btree index
that would have to be completely re-written to support concurrent
updates.

That's also ignoring all the other bits of the filesystem that will
single thread outside the directory. e.g. during create we have to
allocate an inode, and locality algorithms will cluster new inodes
in the same directory close together. That means they are covered by
the same exclusive lock (e.g. the AGI and AGF header blocks in XFS).
Unlink has the same problem.

IOWs, it's not just directory ops and structures that need locking
changes; the way filesystems do inode and block allocation and
freeing also needs to change to support increased concurrency in
directory operations.

Hence I suspect that concurrent directory mods for filesystems like
XFS will need a new processing model - possibly a low overhead
intent-based modification model using in-memory whiteouts and async
background batching of intents. We kinda already do this with unlink
- we do the directory removal in the foreground, and defer the rest
of the unlink (i.e. inode freeing) to async background worker
threads.

e.g. doing background batching of namespace ops means things like
"rm *" in a directory doesn't need to transactionally modify the
directory as it runs. We could track all the inodes we are unlinking
via the intents and then simply truncate away the entire directory
when it becomes empty and rmdir() is called. We still have to clean
up and mark all the inodes free, but that can be done in the
background.

As such, I suspect that moving XFS to a more async processing model
for directory namespace ops to minimise lock hold times will be
simpler (and potentially faster!) than rewriting large chunks of the
XFS directory and inode management operations to allow for
i_rwsem/ILOCK/AGI/AGF locking concurrency...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx