On Mon, May 13, 2024 at 05:33:32PM +0100, Al Viro wrote: > On Mon, May 13, 2024 at 08:58:33AM -0700, Linus Torvalds wrote: > > > We *could* strive for a hybrid approach, where we handle the common > > case ("not a ton of child dentries") differently, and just get rid of > > them synchronously, and handle the "millions of children" case by > > unhashing the directory and dealing with shrinking the children async. > > try_to_shrink_children()? Doable, and not even that hard to do, but > as for shrinking async... We can easily move it out of inode_lock > on parent, but doing that really async would either need to be > tied into e.g. remount r/o logics or we'd get userland regressions. > > I mean, "I have an opened unlinked file, can't remount r/o" is one > thing, but "I've done rm -rf ~luser, can't remount r/o for a while" > when all luser's processes had been killed and nothing is holding > any of that opened... ouch. There is no ouch for the vast majority of users: XFS has been doing background async inode unlink processing since 5.14 (i.e. for almost 3 years now). See commit ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") for more of the background on this change - it was implemented because we needed to allow the scrub code to drop inode references from within transaction contexts, and evict() processing could run a nested transaction which could then deadlock the filesystem. Hence XFS offloads the inode freeing half of the unlink operation (i.e. the bit that happens in evict() context) to per-cpu workqueues instead of doing the work directly in evict() context. We allow evict() to completely tear down the VFS inode context, but don't free it in ->destroy_inode() because we still have work to do on it. XFS doesn't need an active VFS inode context to do the internal metadata updates needed to free an inode, so it's trivial to defer this work to a background context outside the VFS inode life cycle. Hence over half the processing work of every unlink() operation on XFS is now done in kworker threads rather than via the unlink() syscall context. Yes, that means freeze, remount r/o and unmount will block in xfs_inodegc_stop() waiting for these async inode freeing operations to be flushed and completed. However, there have been very few reported issues with freeze, remount r/o or unmount being significantly delayed - there's an occasional report of an inodes with tens of millions of extents to free delaying an operation, but that's no change from unlink() taking minutes to run and delaying the operation that way, anyway.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx