Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Wed, 2 Oct 2024 22:42:38 -0400

On Thu, Oct 03, 2024 at 12:20:53PM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote:
> > On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote:
> Which was painful to work with because
> it maintains the existing spin lock based traversal pattern. This
> was necessary because the iterator held a spinlock internally. I
> really didn't like that aspect of it because it perpeutated the need
> to open code the iget/iput game to allow the spinlock to be dropped
> across the inode operation that needed to be performed.
> 
> i.e. adding an dlist iterator didn't clean up any of the other mess
> that sb->s_inodes iteration required...

yeah, true.

that's actually something that does get cleaner with fast-list; because
we're iterating over a radix tree and our iterator is a radix tree
index, the crab-walk thing naturally goes away.

> > My concern is that we've been trying to get away from callbacks for
> > iteration - post spectre they're quite a bit more expensive than
> > external iterators, and we've generally been successful with that. 
> 
> So everyone keeps saying, but the old adage applies here: Penny
> wise, pound foolish.
> 
> Optimising away the callbacks might bring us a few percent
> performance improvement for each operation (e.g. via the dlist
> iterator mechanisms) in a traversal, but that iteration is still
> only single threaded. Hence the maximum processing rate is
> determined by the performance of a single CPU core.
> 
> However, if we change the API to allow for parallelism at the cost
> of a few percent per object operation, then a single CPU core will
> not process quite as many objects as before. However, the moment we
> allow multiple CPU cores to process in parallel, we acheive
> processing rate improvements measured in integer multiples.
> 
> Modern CPUs have concurrency to burn.  Optimising APIs for minimum
> per-operation overhead rather than for concurrent processing
> implementations is the wrong direction to be taking....

OTOH - this is all academic because none of the uses of s_inodes are
_remotely_ fastpaths. Aside from nr_blockdev_pages() it's more or less
all filesystem teardown, or similar frequency.

> > Radix tree doesn't work for us, since our keys are { inum, subvol } - 96
> > bits -
> 
> Sure it does - you just need two layers of radix trees. i.e have a
> radix tree per subvol to index inodes by inum, and a per-sb radix
> tree to index the subvols. With some code to propagate radix tree
> bits from the inode radix tree to the subvol radix tree they then
> largely work in conjunction for filtered searches.

It'd have to be the reverse - index by inum, then subvol, and then we'd
need to do bit stuffing so that a radix tree with a single element is
just a pointer to the element. But - yeah, if the current approach (not
considering the subvol when calculating the hash) becomes an issue, that
might be the way to go.

> This is -exactly- the internal inode cache structure that XFS has.
> We have a per-sb radix tree indexing the allocation groups, and a
> radix tree per allocation group indexing inodes by inode number.
> Hence an inode lookup involves splitting the inum into agno/agino
> pairs, then doing a perag lookup with the agno, and doing a perag
> inode cache lookup with the agino. All of these radix tree
> lookups are lockless...

Speaking of, I'd like to pick your brain on AGIs at some point. We've
been sketching out future scalability work in bcachefs, and I think
that's going to be one of the things we'll end up needing.

Right now the scalability limit is backpointers fsck, but that looks
fairly trivial to solve: there's no reason to run the backpointers ->
extents pass except for debug testing, we can check and repair those
references at runtime, and we can sum up backpointers in a bucket and
check them against the bucket sector counts and skip extents ->
backpointers if they match.

After that, the next scalability limitation should be the main
check_alloc_info pass, and we'll need something analagous to AGIs to
shard that and run it efficiently when the main allocation info doesn't
fit in memory - and it sounds like you have other optimizations that
leverage AGIs as well.