Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Wed, 2 Oct 2024 16:28:35 -0400

On Wed, Oct 02, 2024 at 12:49:13PM GMT, Linus Torvalds wrote:
> On Wed, 2 Oct 2024 at 05:35, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Wed, Oct 02, 2024 at 12:00:01PM +0200, Christian Brauner wrote:
> >
> > > I don't have big conceptual issues with the series otherwise. The only
> > > thing that makes me a bit uneasy is that we are now providing an api
> > > that may encourage filesystems to do their own inode caching even if
> > > they don't really have a need for it just because it's there.  So really
> > > a way that would've solved this issue generically would have been my
> > > preference.
> >
> > Well, that's the problem, isn't it? :/
> >
> > There really isn't a good generic solution for global list access
> > and management.  The dlist stuff kinda works, but it still has
> > significant overhead and doesn't get rid of spinlock contention
> > completely because of the lack of locality between list add and
> > remove operations.
> 
> I much prefer the approach taken in your patch series, to let the
> filesystem own the inode list and keeping the old model as the
> "default list".
> 
> In many ways, that is how *most* of the VFS layer works - it exposes
> helper functions that the filesystems can use (and most do), but
> doesn't force them.
> 
> Yes, the VFS layer does force some things - you can't avoid using
> dentries, for example, because that's literally how the VFS layer
> deals with filenames (and things like mounting etc). And honestly, the
> VFS layer does a better job of filename caching than any filesystem
> really can do, and with the whole UNIX mount model, filenames
> fundamentally cross filesystem boundaries anyway.
> 
> But clearly the VFS layer inode list handling isn't the best it can
> be, and unless we can fix that in some fundamental way (and I don't
> love the "let's use crazy lists instead of a simple one" models) I do
> think that just letting filesystems do their own thing if they have
> something better is a good model.

Well, I don't love adding more indirection and callbacks.

The underlying approach in this patchset of "just use the inode hash
table if that's available" - that I _do_ like, but this seems like
the wrong way to go about it, we're significantly adding to the amount
of special purpose "things" filesystems have to do if they want to
perform well.

Converting the standard inode hash table to an rhashtable (or more
likely, creating a new standard implementation and converting
filesystems one at a time) still needs to happen, and then the "use the
hash table for iteration" approach could use that without every
filesystem having to specialize.

Failing that, or even regardless, I think we do need either dlock-list
or fast-list. "I need some sort of generic list, but fast" is something
I've seen come up way too many times.