The management of the sb->s_inodes list is a scalability limitation; it is protected by a single lock and every inode that is instantiated has to be added to the list. Those inodes then need to be removed from the list when evicted from cache. Hence every inode that moves through the VFS inode cache must take this global scope lock twice. This proves to be a significant limiting factor for concurrent file access workloads that repeatedly miss the dentry cache on lookup. Directory search and traversal workloads are particularly prone to these issues, though on XFS we have enough concurrency capability in file creation and unlink for the sb->s_inodes list to be a limitation there as well. Previous efforts to solve this problem have largely centered around reworking the sb->s_inodes list into something more scalable such as this longstanding patchset does: https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@xxxxxxxxxxxxx/ However, a recent discussion about inode cache behaviour that arose from the bcachefs 6.12-rc1 pull request opened a new direction for us to explore. With both XFS and bcachefs now providing their own per-superblock inode cache implementations, we should try to make use of these inode caches as first class citizens. With that new direction in mind, it became obvious that XFS could elide the sb->s_inodes list completely - "the best part is no part" - if iteration was not reliant on open-coded sb->s_inodes list walks. We already use the internal inode cache for iteration, and we have filters for selecting specific inodes to operate on with specific callback operations. If we had an abstraction for iterating all VFS inodes, we can easily implement that directly on the XFS inode cache. This is what this patchset aims to implement. There are two superblock iterator functions provided. The first is a generic iterator that provides safe, reference counted inodes for the callback to operate on. This is generally what most sb->s_inodes iterators use, and it allows the iterator to drop locks and perform blocking operations on the inode before moving to the next inode in the sb->s_inodes list. There is one quirk to this interface - INO_ITER_REFERENCE - because fsnotify iterates the inode cache -after- evict_inodes() has been called during superblock shutdown to evict all non-referenced inodes. Hence it should only find referenced inodes, and it has a check to skip unreferenced inodes. This flag does the same. However, I suspect this is now somewhat sub-optimal because LSMs can hold references to inodes beyond evict_inodes(), and they don't get torn down until after fsnotify evicts the referenced inodes it holds. However, the landlock LSM doesn't have checks for unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this guard is not consistently applied. I'm undecided on how best to handle this, but it does not need to be solved for this patchset to work. fsnotify and landlock don't need to run -after- evict_inodes(), but moving them to before evict_inodes() mean we now do three full inode cache iterations to evict all the inodes from the cache. That doesn't seem like a good idea when there might be hundreds of millions of cached inodes at unmount. Similarly, needing the iterator to be aware that there should be no unreferenced inodes left when they run doesn't seem like a good idea, either. So perhaps the answer is that the iterator checks for SB_ACTIVE (or some other similar flag) that indicates the superblock is being torn down and so will skip zero-referenced inodes automatically in this case. Like I said - this doesn't need to be solved right now, it's just something to be aware of. The second iterator is the "unsafe" iterator variant that only provides the callback with an existence guarantee. It does this by holding the rcu_read_lock() to guarantee that the inode is not freed from under the callback. There are no validity checks performed on the inode - it is entirely up to the callback to validate the inode can be operated on safely. Hence the "unsafe" variant is only for very specific internal uses only. Nobody should be adding new uses of this function without as there are very few good reasons for external access to inodes without holding a valid reference. I have not decided whether the unsafe callbacks should require a lockdep_assert_in_rcu_read_lock() check in them to clearly document the context under which they are running. The patchset converts all the open coded iterators to use these new iterator functions, which means the only use of sb->s_inodes is now confined to fs/super.c (iterator API) and fs/inode.c (add/remove API). A new superblock operation is then added to call out from the iterators into the filesystem to allow them to run the iteration instead of walking the sb->s_inodes list. XFS is then converted to use this new superblock operation. I didn't use the existing iterator function for this functionality right now as it is currently based on radix tree tag lookups. It also uses a batched 'lookup and lock' mechanism that complicated matters as I developed this code. Hence I open coded a new, simpler cache walk for testing purposes. Now that I have stuff working and I think I have the callback API semantics settled, batched radix tree lookups should still work to minimise the iteration overhead. Also, we might want to tag VFS inodes in the radix tree so that we can filter them efficiently for traversals. This would allow us to use the existing generic inode cache walker rather than a separate variant as this patch set implements. This can be done as future work, though. In terms of scalability improvements, a quick 'will it scale' test demonstrates where the sb->s_inodes list hurts. Running a sharded, share-nothing cold cache workload with 100,000 files per thread in per-thread directories gives the following results on a 4-node 64p machine with 128GB RAM. The workloads "walk", "chmod" and "unlink" are all directory traversal workloads that stream cold cache inodes into the cache. There is enough memory on this test machine that these indoes are not being reclaimed during the workload, and are being freed between steps via drop_caches (which iterates the inode cache and so explicitly tests the new iteration APIs!). Hence the sb->s_inodes scalability issues aren't as bad in these tests as when memory is tight and inodes are being reclaimed (i.e. the issues are worse in real workloads). The "bulkstat" workload uses the XFS bulkstat ioctl to iterate inodes via walking the internal inode btrees. It uses d_mark_dontcache() so it is actually tearing down each inode as soon as it has been sampled by the bulkstat code. Hence it is doing two sb->s_inodes list manipulations per inode and so shows scalability issues much earlier than the other workloads. Before: Filesystem Files Threads Create Walk Chmod Unlink Bulkstat xfs 400000 4 4.269 3.225 4.557 7.316 1.306 xfs 800000 8 4.844 3.227 4.702 7.905 1.908 xfs 1600000 16 6.286 3.296 5.592 8.838 4.392 xfs 3200000 32 8.912 5.681 8.505 11.724 7.085 xfs 6400000 64 15.344 11.144 14.162 18.604 15.494 Bulkstat starts to show issues at 8 threads, walk and chmod between 16 and 32 threads, and unlink is limited by internal XFS stuff. Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the sb->s_inodes list management. After: Filesystem Files Threads Create Walk Chmod Unlink Bulkstat xfs 400000 4 4.140 3.502 4.154 7.242 1.164 xfs 800000 8 4.637 2.836 4.444 7.896 1.093 xfs 1600000 16 5.549 3.054 5.213 8.696 1.107 xfs 3200000 32 8.387 3.218 6.867 10.668 1.125 xfs 6400000 64 14.112 3.953 10.365 18.620 1.270 When patched, walk shows little in way of scalability degradation out to 64 threads, chmod is significantly improved at 32-64 threads, and bulkstat shows perfect scalability out to 64 threads now. I did a couple of other longer running, higher inode count tests with bulkstat to get an idea of inode cache streaming rates - 32 million inodes scanned in 4.4 seconds at 64 threads. That's about 7.2 million inodes/s being streamed through the inode cache with the IO rates are peaking well above 5.5GB/s (near IO bound). Hence raw VFS inode cache throughput sees a ~17x scalability improvement on XFS at 64 threads (and probably a -lot- more on higher CPU count machines). That's far better performance than I ever got from the dlist conversion of the sb->s_inodes list in previous patchsets, so this seems like a much better direction to be heading for optimising the way we cache inodes. I haven't done a lot of testing on this patchset yet - it boots and appears to work OK for block devices, ext4 and XFS, but checking stuff like quota on/off is still working properly on ext4 hasn't been done yet. What do people think of moving towards per-sb inode caching and traversal mechanisms like this? -Dave.