On Thu, Oct 03, 2024 at 12:20:53PM GMT, Dave Chinner wrote: > On Wed, Oct 02, 2024 at 09:22:38PM -0400, Kent Overstreet wrote: > > On Thu, Oct 03, 2024 at 09:17:08AM GMT, Dave Chinner wrote: > Which was painful to work with because > it maintains the existing spin lock based traversal pattern. This > was necessary because the iterator held a spinlock internally. I > really didn't like that aspect of it because it perpeutated the need > to open code the iget/iput game to allow the spinlock to be dropped > across the inode operation that needed to be performed. > > i.e. adding an dlist iterator didn't clean up any of the other mess > that sb->s_inodes iteration required... yeah, true. that's actually something that does get cleaner with fast-list; because we're iterating over a radix tree and our iterator is a radix tree index, the crab-walk thing naturally goes away. > > My concern is that we've been trying to get away from callbacks for > > iteration - post spectre they're quite a bit more expensive than > > external iterators, and we've generally been successful with that. > > So everyone keeps saying, but the old adage applies here: Penny > wise, pound foolish. > > Optimising away the callbacks might bring us a few percent > performance improvement for each operation (e.g. via the dlist > iterator mechanisms) in a traversal, but that iteration is still > only single threaded. Hence the maximum processing rate is > determined by the performance of a single CPU core. > > However, if we change the API to allow for parallelism at the cost > of a few percent per object operation, then a single CPU core will > not process quite as many objects as before. However, the moment we > allow multiple CPU cores to process in parallel, we acheive > processing rate improvements measured in integer multiples. > > Modern CPUs have concurrency to burn. Optimising APIs for minimum > per-operation overhead rather than for concurrent processing > implementations is the wrong direction to be taking.... OTOH - this is all academic because none of the uses of s_inodes are _remotely_ fastpaths. Aside from nr_blockdev_pages() it's more or less all filesystem teardown, or similar frequency. > > Radix tree doesn't work for us, since our keys are { inum, subvol } - 96 > > bits - > > Sure it does - you just need two layers of radix trees. i.e have a > radix tree per subvol to index inodes by inum, and a per-sb radix > tree to index the subvols. With some code to propagate radix tree > bits from the inode radix tree to the subvol radix tree they then > largely work in conjunction for filtered searches. It'd have to be the reverse - index by inum, then subvol, and then we'd need to do bit stuffing so that a radix tree with a single element is just a pointer to the element. But - yeah, if the current approach (not considering the subvol when calculating the hash) becomes an issue, that might be the way to go. > This is -exactly- the internal inode cache structure that XFS has. > We have a per-sb radix tree indexing the allocation groups, and a > radix tree per allocation group indexing inodes by inode number. > Hence an inode lookup involves splitting the inum into agno/agino > pairs, then doing a perag lookup with the agno, and doing a perag > inode cache lookup with the agino. All of these radix tree > lookups are lockless... Speaking of, I'd like to pick your brain on AGIs at some point. We've been sketching out future scalability work in bcachefs, and I think that's going to be one of the things we'll end up needing. Right now the scalability limit is backpointers fsck, but that looks fairly trivial to solve: there's no reason to run the backpointers -> extents pass except for debug testing, we can check and repair those references at runtime, and we can sum up backpointers in a bucket and check them against the bucket sector counts and skip extents -> backpointers if they match. After that, the next scalability limitation should be the main check_alloc_info pass, and we'll need something analagous to AGIs to shard that and run it efficiently when the main allocation info doesn't fit in memory - and it sounds like you have other optimizations that leverage AGIs as well.