Re: [LSF/MM/BPF TOPIC] allowing parallel directory modifications at the VFS layer

Jeff Layton <jlayton@xxxxxxxxxx> · Mon, 20 Jan 2025 06:55:16 -0500

On Mon, 2025-01-20 at 09:25 +1100, NeilBrown wrote:
> On Mon, 20 Jan 2025, Dave Chinner wrote:
> > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote:
> > > 
> > > My question to fs-devel is: is anyone willing to convert their fs (or
> > > advice me on converting?) to use the new interface and do some testing
> > > and be open to exploring any bugs that appear?
> > 
> > tl;dr: You're asking for people to put in a *lot* of time to convert
> > complex filesystems to concurrent directory modifications without
> > clear indication that it will improve performance. Hence I wouldn't
> > expect widespread enthusiasm to suddenly implement it...
> 
> Thanks Dave!
> Your point as detailed below seems to be that, for xfs at least, it may
> be better to reduce hold times for exclusive locks rather than allow
> concurrent locks.  That seems entirely credible for a local fs but
> doesn't apply for NFS as we cannot get a success status before the
> operation is complete.  So it seems likely that different filesystems
> will want different approaches.  No surprise.
> 
> There is some evidence that ext4 can be converted to concurrent
> modification without a lot of work, and with measurable benefits.  I
> guess I should focus there for local filesystems.
> 
> But I don't want to assume what is best for each file system which is
> why I asked for input from developers of the various filesystems.
> 
> But even for xfs, I think that to provide a successful return from mkdir
> would require waiting for some IO to complete, and that other operations
> might benefit from starting before that IO completes.
> So maybe an xfs implementation of mkdir_shared would be:
>  - take internal exclusive lock on directory
>  - run fast foreground part of mkdir
>  - drop the lock
>  - wait for background stuff, which could affect error return, to
>   complete
>  - return appropriate error, or success
> 
> So xfs could clearly use exclusive locking where that is the best
> choice, but not have exclusive locking imposed for the entire operation.
> That is my core goal : don't impose a particular locking style - allow
> the filesystem to manage locking within an umbrella that ensures the
> guarantees that the vfs needs (like no creation inside a directory
> during rmdir).
> 
> 

I too don't think this approach is incompatible with XFS necessarily,
but we may very well find that once we remove the bottleneck of the
exclusive i_rwsem around directory modifications, that the bottleneck
just moves down to within the filesystem driver. We have to start
somewhere though!

Also, I should mention that it looks like Al won't be able to attend
LSF/MM this year. We may not want to schedule a time slot for this
after all, as I think his involvement will be key here.

> 
> > 
> > In more detail....
> > 
> > It's not exactly simple to take a directory tree structure that is
> > exclusively locked for modification and make it safe for concurrent
> > updates. It -might- be possible to make the directory updates in XFS
> > more concurrent, but it still has an internal name hash btree index
> > that would have to be completely re-written to support concurrent
> > updates.
> > 
> > That's also ignoring all the other bits of the filesystem that will
> > single thread outside the directory. e.g. during create we have to
> > allocate an inode, and locality algorithms will cluster new inodes
> > in the same directory close together. That means they are covered by
> > the same exclusive lock (e.g. the AGI and AGF header blocks in XFS).
> > Unlink has the same problem.
> > 
> > IOWs, it's not just directory ops and structures that need locking
> > changes; the way filesystems do inode and block allocation and
> > freeing also needs to change to support increased concurrency in
> > directory operations.
> > 
> > Hence I suspect that concurrent directory mods for filesystems like
> > XFS will need a new processing model - possibly a low overhead
> > intent-based modification model using in-memory whiteouts and async
> > background batching of intents. We kinda already do this with unlink
> > - we do the directory removal in the foreground, and defer the rest
> > of the unlink (i.e. inode freeing) to async background worker
> > threads.
> > 
> > e.g. doing background batching of namespace ops means things like
> > "rm *" in a directory doesn't need to transactionally modify the
> > directory as it runs. We could track all the inodes we are unlinking
> > via the intents and then simply truncate away the entire directory
> > when it becomes empty and rmdir() is called. We still have to clean
> > up and mark all the inodes free, but that can be done in the
> > background.
> > 
> > As such, I suspect that moving XFS to a more async processing model
> > for directory namespace ops to minimise lock hold times will be
> > simpler (and potentially faster!) than rewriting large chunks of the
> > XFS directory and inode management operations to allow for
> > i_rwsem/ILOCK/AGI/AGF locking concurrency...
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@xxxxxxxxxxxxx
> > 
> 

-- 
Jeff Layton <jlayton@xxxxxxxxxx>