On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote: > The management of the sb->s_inodes list is a scalability limitation; > it is protected by a single lock and every inode that is > instantiated has to be added to the list. Those inodes then need to > be removed from the list when evicted from cache. Hence every inode > that moves through the VFS inode cache must take this global scope > lock twice. > > This proves to be a significant limiting factor for concurrent file > access workloads that repeatedly miss the dentry cache on lookup. > Directory search and traversal workloads are particularly prone to > these issues, though on XFS we have enough concurrency capability > in file creation and unlink for the sb->s_inodes list to be a > limitation there as well. > > Previous efforts to solve this problem have > largely centered around reworking the sb->s_inodes list into > something more scalable such as this longstanding patchset does: > > https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@xxxxxxxxxxxxx/ > > However, a recent discussion about inode cache behaviour that arose > from the bcachefs 6.12-rc1 pull request opened a new direction for > us to explore. With both XFS and bcachefs now providing their own > per-superblock inode cache implementations, we should try to make > use of these inode caches as first class citizens. > > With that new direction in mind, it became obvious that XFS could > elide the sb->s_inodes list completely - "the best part is no part" > - if iteration was not reliant on open-coded sb->s_inodes list > walks. > > We already use the internal inode cache for iteration, and we have > filters for selecting specific inodes to operate on with specific > callback operations. If we had an abstraction for iterating > all VFS inodes, we can easily implement that directly on the XFS > inode cache. > > This is what this patchset aims to implement. > > There are two superblock iterator functions provided. The first is a > generic iterator that provides safe, reference counted inodes for > the callback to operate on. This is generally what most sb->s_inodes > iterators use, and it allows the iterator to drop locks and perform > blocking operations on the inode before moving to the next inode in > the sb->s_inodes list. > > There is one quirk to this interface - INO_ITER_REFERENCE - because > fsnotify iterates the inode cache -after- evict_inodes() has been > called during superblock shutdown to evict all non-referenced > inodes. Hence it should only find referenced inodes, and it has > a check to skip unreferenced inodes. This flag does the same. > > However, I suspect this is now somewhat sub-optimal because LSMs can > hold references to inodes beyond evict_inodes(), and they don't get > torn down until after fsnotify evicts the referenced inodes it > holds. However, the landlock LSM doesn't have checks for > unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this > guard is not consistently applied. > > I'm undecided on how best to handle this, but it does not need to be > solved for this patchset to work. fsnotify and > landlock don't need to run -after- evict_inodes(), but moving them > to before evict_inodes() mean we now do three full inode cache > iterations to evict all the inodes from the cache. That doesn't seem > like a good idea when there might be hundreds of millions of cached > inodes at unmount. > > Similarly, needing the iterator to be aware that there should be no > unreferenced inodes left when they run doesn't seem like a good > idea, either. So perhaps the answer is that the iterator checks for > SB_ACTIVE (or some other similar flag) that indicates the superblock > is being torn down and so will skip zero-referenced inodes > automatically in this case. Like I said - this doesn't need to be > solved right now, it's just something to be aware of. > > The second iterator is the "unsafe" iterator variant that only > provides the callback with an existence guarantee. It does this by > holding the rcu_read_lock() to guarantee that the inode is not freed > from under the callback. There are no validity checks performed on > the inode - it is entirely up to the callback to validate the inode > can be operated on safely. > > Hence the "unsafe" variant is only for very specific internal uses > only. Nobody should be adding new uses of this function without > as there are very few good reasons for external access to inodes > without holding a valid reference. I have not decided whether the > unsafe callbacks should require a lockdep_assert_in_rcu_read_lock() > check in them to clearly document the context under which they are > running. > > The patchset converts all the open coded iterators to use these > new iterator functions, which means the only use of sb->s_inodes > is now confined to fs/super.c (iterator API) and fs/inode.c > (add/remove API). A new superblock operation is then added to > call out from the iterators into the filesystem to allow them to run > the iteration instead of walking the sb->s_inodes list. > > XFS is then converted to use this new superblock operation. I didn't > use the existing iterator function for this functionality right now > as it is currently based on radix tree tag lookups. It also uses a > batched 'lookup and lock' mechanism that complicated matters as I > developed this code. Hence I open coded a new, simpler cache walk > for testing purposes. > > Now that I have stuff working and I think I have the callback API > semantics settled, batched radix tree lookups should still work to > minimise the iteration overhead. Also, we might want to tag VFS > inodes in the radix tree so that we can filter them efficiently for > traversals. This would allow us to use the existing generic inode > cache walker rather than a separate variant as this patch set > implements. This can be done as future work, though. > > In terms of scalability improvements, a quick 'will it scale' test > demonstrates where the sb->s_inodes list hurts. Running a sharded, > share-nothing cold cache workload with 100,000 files per thread in > per-thread directories gives the following results on a 4-node 64p > machine with 128GB RAM. > > The workloads "walk", "chmod" and "unlink" are all directory > traversal workloads that stream cold cache inodes into the cache. > There is enough memory on this test machine that these indoes are > not being reclaimed during the workload, and are being freed between > steps via drop_caches (which iterates the inode cache and so > explicitly tests the new iteration APIs!). Hence the sb->s_inodes > scalability issues aren't as bad in these tests as when memory is > tight and inodes are being reclaimed (i.e. the issues are worse in > real workloads). > > The "bulkstat" workload uses the XFS bulkstat ioctl to iterate > inodes via walking the internal inode btrees. It uses > d_mark_dontcache() so it is actually tearing down each inode as soon > as it has been sampled by the bulkstat code. Hence it is doing two > sb->s_inodes list manipulations per inode and so shows scalability > issues much earlier than the other workloads. > > Before: > > Filesystem Files Threads Create Walk Chmod Unlink Bulkstat > xfs 400000 4 4.269 3.225 4.557 7.316 1.306 > xfs 800000 8 4.844 3.227 4.702 7.905 1.908 > xfs 1600000 16 6.286 3.296 5.592 8.838 4.392 > xfs 3200000 32 8.912 5.681 8.505 11.724 7.085 > xfs 6400000 64 15.344 11.144 14.162 18.604 15.494 > > Bulkstat starts to show issues at 8 threads, walk and chmod between > 16 and 32 threads, and unlink is limited by internal XFS stuff. > Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the > sb->s_inodes list management. > > After: > > Filesystem Files Threads Create Walk Chmod Unlink Bulkstat > xfs 400000 4 4.140 3.502 4.154 7.242 1.164 > xfs 800000 8 4.637 2.836 4.444 7.896 1.093 > xfs 1600000 16 5.549 3.054 5.213 8.696 1.107 > xfs 3200000 32 8.387 3.218 6.867 10.668 1.125 > xfs 6400000 64 14.112 3.953 10.365 18.620 1.270 > > When patched, walk shows little in way of scalability degradation > out to 64 threads, chmod is significantly improved at 32-64 threads, > and bulkstat shows perfect scalability out to 64 threads now. > > I did a couple of other longer running, higher inode count tests > with bulkstat to get an idea of inode cache streaming rates - 32 > million inodes scanned in 4.4 seconds at 64 threads. That's about > 7.2 million inodes/s being streamed through the inode cache with the > IO rates are peaking well above 5.5GB/s (near IO bound). > > Hence raw VFS inode cache throughput sees a ~17x scalability > improvement on XFS at 64 threads (and probably a -lot- more on > higher CPU count machines). That's far better performance than I > ever got from the dlist conversion of the sb->s_inodes list in > previous patchsets, so this seems like a much better direction to be > heading for optimising the way we cache inodes. > > I haven't done a lot of testing on this patchset yet - it boots and > appears to work OK for block devices, ext4 and XFS, but checking > stuff like quota on/off is still working properly on ext4 hasn't > been done yet. > > What do people think of moving towards per-sb inode caching and > traversal mechanisms like this? Patches 1-4 are great cleanups that I would like us to merge even independent of the rest. I don't have big conceptual issues with the series otherwise. The only thing that makes me a bit uneasy is that we are now providing an api that may encourage filesystems to do their own inode caching even if they don't really have a need for it just because it's there. So really a way that would've solved this issue generically would have been my preference. But the reality is that xfs has been doing that private inode cache for a long time and reading through 5/7 and 6/7 it clearly provides value for xfs. So I find it hard to object to adding ->iter_vfs_inodes() (Though I would like to s/iter_vfs_inodes/iter_inodes/g).