Re: [GIT PULL] bcachefs changes for 6.12-rc1

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Mon, 23 Sep 2024 15:56:50 -0400

On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote:
> On Sat, 21 Sept 2024 at 12:28, Kent Overstreet
> <kent.overstreet@xxxxxxxxx> wrote:
> >
> > We're now using an rhashtable instead of the system inode hash table;
> > this is another significant performance improvement on multithreaded
> > metadata workloads, eliminating more lock contention.
> 
> So I've pulled this, but I reacted to this issue - what is the load
> where the inode hash table is actually causing issues?
> 
> Because honestly, the reason we're using that "one single lock" for
> the inode hash table is that nobody has ever bothered.
> 
> In particular, it *should* be reasonably straightforward (lots of
> small details, but not fundamentally hard) to just make the inode hash
> table be a "bl" hash - with the locking done per-bucket, not using one
> global lock.

Dave was working on that - he posted patches and it seemed like they
were mostly done, not sure what happened with those.

> But nobody has ever seemed to care before. Because the inode hash just
> isn't all that heavily used, since normal loads don't look up inodes
> directly (they look up dentries, and you find the inodes off those -
> and the *dentry* hash scales well thanks to both RCU and doing the
> bucket lock thing).
> 
> So the fact that bcachefs cares makes me go "Why?"

I've seen benchmarks (not just on bcachefs, ext4 as well) where it was
actually _really_ heavy. They were synthetics, but I'd expect to see the
same effect from workloads where we're streaming a lot of inodes through
the cache - think rsync.

Another data point is that the btree key cache work I just did was also
for exactly this type of situation - it addressed lock contention when
streaming items into and out of the key cache, which are mostly inodes
(it's also used for the alloc btree) - and I had reports from multiple
users it was a drastic improvement, much more than expected, and
eliminated the main source of our "srcu lock held too long" warnings -
and if lock contention is bad enough to trip a 10 second warning, you
know it's bad, heh.

I try to really jump on these sorts of lock contention issues because
I've seen many instances where they caused performance cliffs - in
practice they're often O(n^2) issues because as your workload scales up
it's usually not just number of instances where you take the lock that
increases, usually something else is scaling too - even just increased
cacheline bouncing from the lock contention itself will be problematic.