On Sat, Jan 18, 2025 at 12:06:30PM +1100, NeilBrown wrote: > On Sat, 18 Jan 2025, Jeff Layton wrote: > > We've hit a number of cases in testing recently where the parent's > > i_rwsem ends up being the bottleneck in heavy parallel create > > workloads. Currently we have to take the parent's inode->i_rwsem > > exclusively when altering a directory, which means that any directory- > > morphing operations in the same directory are serialized. > > > > This is particularly onerous in the ->create codepath, since a > > filesystem may have to do a number of blocking operations to create a > > new file (allocate memory, start a transaction, etc.) > > > > Neil recently posted this RFC series, which allows parallel directory > > modifying operations: > > > > https://lore.kernel.org/linux-fsdevel/20241220030830.272429-1-neilb@xxxxxxx/ > > > > Al pointed out a number of problems in it, but the basic approach seems > > sound. I'd like to have a discussion at LSF/MM about this. > > > > Are there any problems with the basic approach? Are there other > > approaches that might be better? Are there incremental steps we could > > do pave the way for this to be a reality? > > Thanks for raising this! > There was at least one problem with the approach but I have a plan to > address that. I won't go into detail here. I hope to get a new > patch set out sometime in the coming week. > > My question to fs-devel is: is anyone willing to convert their fs (or > advice me on converting?) to use the new interface and do some testing > and be open to exploring any bugs that appear? tl;dr: You're asking for people to put in a *lot* of time to convert complex filesystems to concurrent directory modifications without clear indication that it will improve performance. Hence I wouldn't expect widespread enthusiasm to suddenly implement it... In more detail.... It's not exactly simple to take a directory tree structure that is exclusively locked for modification and make it safe for concurrent updates. It -might- be possible to make the directory updates in XFS more concurrent, but it still has an internal name hash btree index that would have to be completely re-written to support concurrent updates. That's also ignoring all the other bits of the filesystem that will single thread outside the directory. e.g. during create we have to allocate an inode, and locality algorithms will cluster new inodes in the same directory close together. That means they are covered by the same exclusive lock (e.g. the AGI and AGF header blocks in XFS). Unlink has the same problem. IOWs, it's not just directory ops and structures that need locking changes; the way filesystems do inode and block allocation and freeing also needs to change to support increased concurrency in directory operations. Hence I suspect that concurrent directory mods for filesystems like XFS will need a new processing model - possibly a low overhead intent-based modification model using in-memory whiteouts and async background batching of intents. We kinda already do this with unlink - we do the directory removal in the foreground, and defer the rest of the unlink (i.e. inode freeing) to async background worker threads. e.g. doing background batching of namespace ops means things like "rm *" in a directory doesn't need to transactionally modify the directory as it runs. We could track all the inodes we are unlinking via the intents and then simply truncate away the entire directory when it becomes empty and rmdir() is called. We still have to clean up and mark all the inodes free, but that can be done in the background. As such, I suspect that moving XFS to a more async processing model for directory namespace ops to minimise lock hold times will be simpler (and potentially faster!) than rewriting large chunks of the XFS directory and inode management operations to allow for i_rwsem/ILOCK/AGI/AGF locking concurrency... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx