Re: [RFC PATCH 0/7] vfs: improving inode cache iteration scalability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 03, 2024 at 12:20:53PM GMT, Dave Chinner wrote:
> On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote:
> > On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote:
> Which was painful to work with because
> it maintains the existing spin lock based traversal pattern. This
> was necessary because the iterator held a spinlock internally. I
> really didn't like that aspect of it because it perpeutated the need
> to open code the iget/iput game to allow the spinlock to be dropped
> across the inode operation that needed to be performed.
> 
> i.e. adding an dlist iterator didn't clean up any of the other mess
> that sb->s_inodes iteration required...

yeah, true.

that's actually something that does get cleaner with fast-list; because
we're iterating over a radix tree and our iterator is a radix tree
index, the crab-walk thing naturally goes away.

> > My concern is that we've been trying to get away from callbacks for
> > iteration - post spectre they're quite a bit more expensive than
> > external iterators, and we've generally been successful with that. 
> 
> So everyone keeps saying, but the old adage applies here: Penny
> wise, pound foolish.
> 
> Optimising away the callbacks might bring us a few percent
> performance improvement for each operation (e.g. via the dlist
> iterator mechanisms) in a traversal, but that iteration is still
> only single threaded. Hence the maximum processing rate is
> determined by the performance of a single CPU core.
> 
> However, if we change the API to allow for parallelism at the cost
> of a few percent per object operation, then a single CPU core will
> not process quite as many objects as before. However, the moment we
> allow multiple CPU cores to process in parallel, we acheive
> processing rate improvements measured in integer multiples.
> 
> Modern CPUs have concurrency to burn.  Optimising APIs for minimum
> per-operation overhead rather than for concurrent processing
> implementations is the wrong direction to be taking....

OTOH - this is all academic because none of the uses of s_inodes are
_remotely_ fastpaths. Aside from nr_blockdev_pages() it's more or less
all filesystem teardown, or similar frequency.

> > Radix tree doesn't work for us, since our keys are { inum, subvol } - 96
> > bits -
> 
> Sure it does - you just need two layers of radix trees. i.e have a
> radix tree per subvol to index inodes by inum, and a per-sb radix
> tree to index the subvols. With some code to propagate radix tree
> bits from the inode radix tree to the subvol radix tree they then
> largely work in conjunction for filtered searches.

It'd have to be the reverse - index by inum, then subvol, and then we'd
need to do bit stuffing so that a radix tree with a single element is
just a pointer to the element. But - yeah, if the current approach (not
considering the subvol when calculating the hash) becomes an issue, that
might be the way to go.

> This is -exactly- the internal inode cache structure that XFS has.
> We have a per-sb radix tree indexing the allocation groups, and a
> radix tree per allocation group indexing inodes by inode number.
> Hence an inode lookup involves splitting the inum into agno/agino
> pairs, then doing a perag lookup with the agno, and doing a perag
> inode cache lookup with the agino. All of these radix tree
> lookups are lockless...

Speaking of, I'd like to pick your brain on AGIs at some point. We've
been sketching out future scalability work in bcachefs, and I think
that's going to be one of the things we'll end up needing.

Right now the scalability limit is backpointers fsck, but that looks
fairly trivial to solve: there's no reason to run the backpointers ->
extents pass except for debug testing, we can check and repair those
references at runtime, and we can sum up backpointers in a bucket and
check them against the bucket sector counts and skip extents ->
backpointers if they match.

After that, the next scalability limitation should be the main
check_alloc_info pass, and we'll need something analagous to AGIs to
shard that and run it efficiently when the main allocation info doesn't
fit in memory - and it sounds like you have other optimizations that
leverage AGIs as well.




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux