Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Wed, 2 Oct 2024 22:24:36 -0400

On Thu, Oct 03, 2024 at 11:41:42AM GMT, Dave Chinner wrote:
> > A couple things that help - we've already determined that the inode LRU
> > can go away for most filesystems,
> 
> We haven't determined that yet. I *think* it is possible, but there
> is a really nasty inode LRU dependencies that has been driven deep
> down into the mm page cache writeback code.  We have to fix that
> awful layering violation before we can get rid of the inode LRU.
> 
> I *think* we can do it by requiring dirty inodes to hold an explicit
> inode reference, thereby keeping the inode pinned in memory whilst
> it is being tracked for writeback. That would also get rid of the
> nasty hacks needed in evict() to wait on writeback to complete on
> unreferenced inodes.
> 
> However, this isn't simple to do, and so getting rid of the inode
> LRU is not going to happen in the near term.

Ok.

> > and we can preallocate slots without
> > actually adding objects. Iteration will see NULLs that they skip over,
> > so we can't simply preallocate a slot for everything if nr_live_objects
> > / nr_lru_objects is too big. But, we can certainly preallocate slots on
> > a given code path and then release them back to the percpu buffer if
> > they're not used.
> 
> I'm not really that interested in spending time trying to optimise
> away list_lru contention at this point in time.

Fair, and I'm not trying to derail this one - whether dlock-list, or
fast-list, or super_iter_inodes(), we should do one of them. On current
mainline, I see lock contention bad enough to trigger bcachefs's 10
second "srcu lock held too long" warning, any of these solves the
biggest problem.

But...

> It's not a performance limiting factor because inode and
> dentry LRU scalability is controllable by NUMA configuration. i.e.
> if you have severe list_lru lock contention on inode and dentry
> caches, then either turn on Sub-NUMA Clustering in your bios,
> configure your VM with more discrete nodes, or use the fake-numa=N
> boot parameter to increase the number of nodes the kernel sets up.
> This will increase the number of list_lru instances for NUMA aware
> shrinkers and the contention will go away.

I don't buy this, asking users to change their bios (and even to know
that's a thing they should consider) is a big ask. Linked lists _suck_,
both w.r.t. locking and cache behaviour, and we need to be exploring
better options.

> > > - LRU lists are -ordered- (it's right there in the name!) and this
> > >   appears to be an unordered list construct.
> > 
> > Yes, it is. But in actual practice cache replacement policy tends not to
> > matter nearly as much as people think; there's many papers showing real
> > world hit ratio of common algorithms is only a fudge factor from random
> > replacement - the main thing you want is an accessed bit (or counter, if
> > you want the analagous version of n-lru for n > 2), and we'll still have
> > that.
> 
> Sure.  But I can cherry-pick many papers showing exactly the opposite.
> i.e. that LRU and LFU algorithms are far superior at maintaining a
> working set compared to random cache shootdown, especially when
> there is significant latency for cache replacement.

But as mentioned we won't be comparing against pure random, it'll be vs.
pure random with at least an accessed bit, preserving the multiple
generations which are the most important feature of LRU/LFU as we use
them.

> What matters is whether there are any behavioural regressions as a
> result of changing the current algorithm. We've used quasi-LRU
> working set management for so long that this is the behaviour that
> people have tuned their systems and applications to work well with.
> Fundamental changes to working set maintenance behaviour is not
> something I'm considering doing, nor something I *want* to do.

Yup, it's not going to be the easiest thing to tackle.