Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 3 Oct 2024 12:20:53 +1000

On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote:
> On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote:
> > On Wed, Oct 02, 2024 at 04:28:35PM -0400, Kent Overstreet wrote:
> > > On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> > > > On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > >
> > > > > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> > > > >
> > > > > > I don't have big conceptual issues with the series otherwise. The only
> > > > > > thing that makes me a bit uneasy is that we are now providing an api
> > > > > > that may encourage filesystems to do their own inode caching even if
> > > > > > they don't really have a need for it just because it's there.  So really
> > > > > > a way that would've solved this issue generically would have been my
> > > > > > preference.
> > > > >
> > > > > Well, that's the problem, isn't it? :/
> > > > >
> > > > > There really isn't a good generic solution for global list access
> > > > > and management.  The dlist stuff kinda works, but it still has
> > > > > significant overhead and doesn't get rid of spinlock contention
> > > > > completely because of the lack of locality between list add and
> > > > > remove operations.
> > > > 
> > > > I much prefer the approach taken in your patch series, to let the
> > > > filesystem own the inode list and keeping the old model as the
> > > > "default list".
> > > > 
> > > > In many ways, that is how *most* of the VFS layer works - it exposes
> > > > helper functions that the filesystems can use (and most do), but
> > > > doesn't force them.
> > > > 
> > > > Yes, the VFS layer does force some things - you can't avoid using
> > > > dentries, for example, because that's literally how the VFS layer
> > > > deals with filenames (and things like mounting etc). And honestly, the
> > > > VFS layer does a better job of filename caching than any filesystem
> > > > really can do, and with the whole UNIX mount model, filenames
> > > > fundamentally cross filesystem boundaries anyway.
> > > > 
> > > > But clearly the VFS layer inode list handling isn't the best it can
> > > > be, and unless we can fix that in some fundamental way (and I don't
> > > > love the "let's use crazy lists instead of a simple one" models) I do
> > > > think that just letting filesystems do their own thing if they have
> > > > something better is a good model.
> > > 
> > > Well, I don't love adding more indirection and callbacks.
> > 
> > It's way better than open coding inode cache traversals everywhere.
> 
> Eh? You had a nice iterator for dlock-list :)

Which was painful to work with because
it maintains the existing spin lock based traversal pattern. This
was necessary because the iterator held a spinlock internally. I
really didn't like that aspect of it because it perpeutated the need
to open code the iget/iput game to allow the spinlock to be dropped
across the inode operation that needed to be performed.

i.e. adding an dlist iterator didn't clean up any of the other mess
that sb->s_inodes iteration required...

> > We simply cannot do things like that without a new iteration model.
> > Abstraction is necessary to facilitate a new iteration model, and a
> > model that provides independent object callbacks allows scope for
> > concurrent processing of individual objects.
> 
> Parallelized iteration is a slick possibility.
> 
> My concern is that we've been trying to get away from callbacks for
> iteration - post spectre they're quite a bit more expensive than
> external iterators, and we've generally been successful with that. 

So everyone keeps saying, but the old adage applies here: Penny
wise, pound foolish.

Optimising away the callbacks might bring us a few percent
performance improvement for each operation (e.g. via the dlist
iterator mechanisms) in a traversal, but that iteration is still
only single threaded. Hence the maximum processing rate is
determined by the performance of a single CPU core.

However, if we change the API to allow for parallelism at the cost
of a few percent per object operation, then a single CPU core will
not process quite as many objects as before. However, the moment we
allow multiple CPU cores to process in parallel, we acheive
processing rate improvements measured in integer multiples.

Modern CPUs have concurrency to burn.  Optimising APIs for minimum
per-operation overhead rather than for concurrent processing
implementations is the wrong direction to be taking....

> > > Converting the standard inode hash table to an rhashtable (or more
> > > likely, creating a new standard implementation and converting
> > > filesystems one at a time) still needs to happen, and then the "use the
> > > hash table for iteration" approach could use that without every
> > > filesystem having to specialize.
> > 
> > Yes, but this still doesn't help filesystems like XFS where the
> > structure of the inode cache is highly optimised for the specific
> > on-disk and in-memory locality of inodes. We aren't going to be
> > converting XFS to a rhashtable based inode cache anytime soon
> > because it simply doesn't provide the functionality we require.
> > e.g. efficient lockless sequential inode number ordered traversal in
> > -every- inode cluster writeback operation.
> 
> I was going to ask what your requirements are - I may take on the
> general purpose inode rhashtable code, although since I'm still pretty
> buried we'll see.
> 
> Coincidentally, just today I'm working on an issue in bcachefs where
> we'd also prefer an ordered data structure to a hash table for the inode
> cache - in online fsck, we need to be able to check if an inode is still
> open, but checking for an inode in an interior snapshot node means we
> have to do a scan and check if any of the open inodes are in a
> descendent subvolume.
> 
> Radix tree doesn't work for us, since our keys are { inum, subvol } - 96
> bits -

Sure it does - you just need two layers of radix trees. i.e have a
radix tree per subvol to index inodes by inum, and a per-sb radix
tree to index the subvols. With some code to propagate radix tree
bits from the inode radix tree to the subvol radix tree they then
largely work in conjunction for filtered searches.

This is -exactly- the internal inode cache structure that XFS has.
We have a per-sb radix tree indexing the allocation groups, and a
radix tree per allocation group indexing inodes by inode number.
Hence an inode lookup involves splitting the inum into agno/agino
pairs, then doing a perag lookup with the agno, and doing a perag
inode cache lookup with the agino. All of these radix tree
lookups are lockless...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx