On Mon, 2025-01-20 at 09:25 +1100, NeilBrown wrote: > On Mon, 20 Jan 2025, Dave Chinner wrote: > > On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote: > > > > > > My question to fs-devel is: is anyone willing to convert their fs (or > > > advice me on converting?) to use the new interface and do some testing > > > and be open to exploring any bugs that appear? > > > > tl;dr: You're asking for people to put in a *lot* of time to convert > > complex filesystems to concurrent directory modifications without > > clear indication that it will improve performance. Hence I wouldn't > > expect widespread enthusiasm to suddenly implement it... > > Thanks Dave! > Your point as detailed below seems to be that, for xfs at least, it may > be better to reduce hold times for exclusive locks rather than allow > concurrent locks. That seems entirely credible for a local fs but > doesn't apply for NFS as we cannot get a success status before the > operation is complete. So it seems likely that different filesystems > will want different approaches. No surprise. > > There is some evidence that ext4 can be converted to concurrent > modification without a lot of work, and with measurable benefits. I > guess I should focus there for local filesystems. > > But I don't want to assume what is best for each file system which is > why I asked for input from developers of the various filesystems. > > But even for xfs, I think that to provide a successful return from mkdir > would require waiting for some IO to complete, and that other operations > might benefit from starting before that IO completes. > So maybe an xfs implementation of mkdir_shared would be: > - take internal exclusive lock on directory > - run fast foreground part of mkdir > - drop the lock > - wait for background stuff, which could affect error return, to > complete > - return appropriate error, or success > > So xfs could clearly use exclusive locking where that is the best > choice, but not have exclusive locking imposed for the entire operation. > That is my core goal : don't impose a particular locking style - allow > the filesystem to manage locking within an umbrella that ensures the > guarantees that the vfs needs (like no creation inside a directory > during rmdir). > > I too don't think this approach is incompatible with XFS necessarily, but we may very well find that once we remove the bottleneck of the exclusive i_rwsem around directory modifications, that the bottleneck just moves down to within the filesystem driver. We have to start somewhere though! Also, I should mention that it looks like Al won't be able to attend LSF/MM this year. We may not want to schedule a time slot for this after all, as I think his involvement will be key here. > > > > > In more detail.... > > > > It's not exactly simple to take a directory tree structure that is > > exclusively locked for modification and make it safe for concurrent > > updates. It -might- be possible to make the directory updates in XFS > > more concurrent, but it still has an internal name hash btree index > > that would have to be completely re-written to support concurrent > > updates. > > > > That's also ignoring all the other bits of the filesystem that will > > single thread outside the directory. e.g. during create we have to > > allocate an inode, and locality algorithms will cluster new inodes > > in the same directory close together. That means they are covered by > > the same exclusive lock (e.g. the AGI and AGF header blocks in XFS). > > Unlink has the same problem. > > > > IOWs, it's not just directory ops and structures that need locking > > changes; the way filesystems do inode and block allocation and > > freeing also needs to change to support increased concurrency in > > directory operations. > > > > Hence I suspect that concurrent directory mods for filesystems like > > XFS will need a new processing model - possibly a low overhead > > intent-based modification model using in-memory whiteouts and async > > background batching of intents. We kinda already do this with unlink > > - we do the directory removal in the foreground, and defer the rest > > of the unlink (i.e. inode freeing) to async background worker > > threads. > > > > e.g. doing background batching of namespace ops means things like > > "rm *" in a directory doesn't need to transactionally modify the > > directory as it runs. We could track all the inodes we are unlinking > > via the intents and then simply truncate away the entire directory > > when it becomes empty and rmdir() is called. We still have to clean > > up and mark all the inodes free, but that can be done in the > > background. > > > > As such, I suspect that moving XFS to a more async processing model > > for directory namespace ops to minimise lock hold times will be > > simpler (and potentially faster!) than rewriting large chunks of the > > XFS directory and inode management operations to allow for > > i_rwsem/ILOCK/AGI/AGF locking concurrency... > > > > -Dave. > > -- > > Dave Chinner > > david@xxxxxxxxxxxxx > > > -- Jeff Layton <jlayton@xxxxxxxxxx>