Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 02, 2024 at 11:33:17AM GMT, Dave Chinner wrote:
> The management of the sb->s_inodes list is a scalability limitation;
> it is protected by a single lock and every inode that is
> instantiated has to be added to the list. Those inodes then need to
> be removed from the list when evicted from cache. Hence every inode
> that moves through the VFS inode cache must take this global scope
> lock twice.
> 
> This proves to be a significant limiting factor for concurrent file
> access workloads that repeatedly miss the dentry cache on lookup.
> Directory search and traversal workloads are particularly prone to
> these issues, though on XFS we have enough concurrency capability
> in file creation and unlink for the sb->s_inodes list to be a
> limitation there as well.
> 
> Previous efforts to solve this problem have
> largely centered around reworking the sb->s_inodes list into
> something more scalable such as this longstanding patchset does:
> 
> https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@xxxxxxxxxxxxx/
> 
> However, a recent discussion about inode cache behaviour that arose
> from the bcachefs 6.12-rc1 pull request opened a new direction for
> us to explore. With both XFS and bcachefs now providing their own
> per-superblock inode cache implementations, we should try to make
> use of these inode caches as first class citizens.
> 
> With that new direction in mind, it became obvious that XFS could
> elide the sb->s_inodes list completely - "the best part is no part"
> - if iteration was not reliant on open-coded sb->s_inodes list
> walks.
> 
> We already use the internal inode cache for iteration, and we have
> filters for selecting specific inodes to operate on with specific
> callback operations. If we had an abstraction for iterating
> all VFS inodes, we can easily implement that directly on the XFS
> inode cache.
> 
> This is what this patchset aims to implement.
> 
> There are two superblock iterator functions provided. The first is a
> generic iterator that provides safe, reference counted inodes for
> the callback to operate on. This is generally what most sb->s_inodes
> iterators use, and it allows the iterator to drop locks and perform
> blocking operations on the inode before moving to the next inode in
> the sb->s_inodes list.
> 
> There is one quirk to this interface - INO_ITER_REFERENCE - because
> fsnotify iterates the inode cache -after- evict_inodes() has been
> called during superblock shutdown to evict all non-referenced
> inodes. Hence it should only find referenced inodes, and it has
> a check to skip unreferenced inodes. This flag does the same.
> 
> However, I suspect this is now somewhat sub-optimal because LSMs can
> hold references to inodes beyond evict_inodes(), and they don't get
> torn down until after fsnotify evicts the referenced inodes it
> holds. However, the landlock LSM doesn't have checks for
> unreferenced inodes (i.e. doesn't use INO_ITER_REFERENCE), so this
> guard is not consistently applied.
> 
> I'm undecided on how best to handle this, but it does not need to be
> solved for this patchset to work. fsnotify and
> landlock don't need to run -after- evict_inodes(), but moving them
> to before evict_inodes() mean we now do three full inode cache
> iterations to evict all the inodes from the cache. That doesn't seem
> like a good idea when there might be hundreds of millions of cached
> inodes at unmount.
> 
> Similarly, needing the iterator to be aware that there should be no
> unreferenced inodes left when they run doesn't seem like a good
> idea, either. So perhaps the answer is that the iterator checks for
> SB_ACTIVE (or some other similar flag) that indicates the superblock
> is being torn down and so will skip zero-referenced inodes
> automatically in this case. Like I said - this doesn't need to be
> solved right now, it's just something to be aware of.
> 
> The second iterator is the "unsafe" iterator variant that only
> provides the callback with an existence guarantee. It does this by
> holding the rcu_read_lock() to guarantee that the inode is not freed
> from under the callback. There are no validity checks performed on
> the inode - it is entirely up to the callback to validate the inode
> can be operated on safely.
> 
> Hence the "unsafe" variant is only for very specific internal uses
> only. Nobody should be adding new uses of this function without
> as there are very few good reasons for external access to inodes
> without holding a valid reference. I have not decided whether the
> unsafe callbacks should require a lockdep_assert_in_rcu_read_lock()
> check in them to clearly document the context under which they are
> running.
> 
> The patchset converts all the open coded iterators to use these
> new iterator functions, which means the only use of sb->s_inodes
> is now confined to fs/super.c (iterator API) and fs/inode.c
> (add/remove API). A new superblock operation is then added to
> call out from the iterators into the filesystem to allow them to run
> the iteration instead of walking the sb->s_inodes list.
> 
> XFS is then converted to use this new superblock operation. I didn't
> use the existing iterator function for this functionality right now
> as it is currently based on radix tree tag lookups. It also uses a
> batched 'lookup and lock' mechanism that complicated matters as I
> developed this code. Hence I open coded a new, simpler cache walk
> for testing purposes.
> 
> Now that I have stuff working and I think I have the callback API
> semantics settled, batched radix tree lookups should still work to
> minimise the iteration overhead. Also, we might want to tag VFS
> inodes in the radix tree so that we can filter them efficiently for
> traversals. This would allow us to use the existing generic inode
> cache walker rather than a separate variant as this patch set
> implements. This can be done as future work, though.
> 
> In terms of scalability improvements, a quick 'will it scale' test
> demonstrates where the sb->s_inodes list hurts. Running a sharded,
> share-nothing cold cache workload with 100,000 files per thread in
> per-thread directories gives the following results on a 4-node 64p
> machine with 128GB RAM.
> 
> The workloads "walk", "chmod" and "unlink" are all directory
> traversal workloads that stream cold cache inodes into the cache.
> There is enough memory on this test machine that these indoes are
> not being reclaimed during the workload, and are being freed between
> steps via drop_caches (which iterates the inode cache and so
> explicitly tests the new iteration APIs!). Hence the sb->s_inodes
> scalability issues aren't as bad in these tests as when memory is
> tight and inodes are being reclaimed (i.e. the issues are worse in
> real workloads).
> 
> The "bulkstat" workload uses the XFS bulkstat ioctl to iterate
> inodes via walking the internal inode btrees. It uses
> d_mark_dontcache() so it is actually tearing down each inode as soon
> as it has been sampled by the bulkstat code. Hence it is doing two
> sb->s_inodes list manipulations per inode and so shows scalability
> issues much earlier than the other workloads.
> 
> Before:
> 
> Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
>        xfs     400000     4      4.269      3.225      4.557      7.316      1.306
>        xfs     800000     8      4.844      3.227      4.702      7.905      1.908
>        xfs    1600000    16      6.286      3.296      5.592      8.838      4.392
>        xfs    3200000    32      8.912      5.681      8.505     11.724      7.085
>        xfs    6400000    64     15.344     11.144     14.162     18.604     15.494
> 
> Bulkstat starts to show issues at 8 threads, walk and chmod between
> 16 and 32 threads, and unlink is limited by internal XFS stuff.
> Bulkstat is bottlenecked at about 400-450 thousand inodes/s by the
> sb->s_inodes list management.
> 
> After:
> 
> Filesystem      Files  Threads  Create       Walk      Chmod      Unlink     Bulkstat
>        xfs     400000     4      4.140      3.502      4.154      7.242      1.164
>        xfs     800000     8      4.637      2.836      4.444      7.896      1.093
>        xfs    1600000    16      5.549      3.054      5.213      8.696      1.107
>        xfs    3200000    32      8.387      3.218      6.867     10.668      1.125
>        xfs    6400000    64     14.112      3.953     10.365     18.620      1.270
> 
> When patched, walk shows little in way of scalability degradation
> out to 64 threads, chmod is significantly improved at 32-64 threads,
> and bulkstat shows perfect scalability out to 64 threads now.
> 
> I did a couple of other longer running, higher inode count tests
> with bulkstat to get an idea of inode cache streaming rates - 32
> million inodes scanned in 4.4 seconds at 64 threads. That's about
> 7.2 million inodes/s being streamed through the inode cache with the
> IO rates are peaking well above 5.5GB/s (near IO bound).
> 
> Hence raw VFS inode cache throughput sees a ~17x scalability
> improvement on XFS at 64 threads (and probably a -lot- more on
> higher CPU count machines).  That's far better performance than I
> ever got from the dlist conversion of the sb->s_inodes list in
> previous patchsets, so this seems like a much better direction to be
> heading for optimising the way we cache inodes.
> 
> I haven't done a lot of testing on this patchset yet - it boots and
> appears to work OK for block devices, ext4 and XFS, but checking
> stuff like quota on/off is still working properly on ext4 hasn't
> been done yet.
> 
> What do people think of moving towards per-sb inode caching and
> traversal mechanisms like this?

Patches 1-4 are great cleanups that I would like us to merge even
independent of the rest.

I don't have big conceptual issues with the series otherwise. The only
thing that makes me a bit uneasy is that we are now providing an api
that may encourage filesystems to do their own inode caching even if
they don't really have a need for it just because it's there. So really
a way that would've solved this issue generically would have been my
preference.

But the reality is that xfs has been doing that private inode cache for
a long time and reading through 5/7 and 6/7 it clearly provides value
for xfs. So I find it hard to object to adding ->iter_vfs_inodes()
(Though I would like to s/iter_vfs_inodes/iter_inodes/g).




[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux