On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote: > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > <kent.overstreet@xxxxxxxxx> wrote: > > > > We're now using an rhashtable instead of the system inode hash table; > > this is another significant performance improvement on multithreaded > > metadata workloads, eliminating more lock contention. > > So I've pulled this, but I reacted to this issue - what is the load > where the inode hash table is actually causing issues? > > Because honestly, the reason we're using that "one single lock" for > the inode hash table is that nobody has ever bothered. > > In particular, it *should* be reasonably straightforward (lots of > small details, but not fundamentally hard) to just make the inode hash > table be a "bl" hash - with the locking done per-bucket, not using one > global lock. Dave was working on that - he posted patches and it seemed like they were mostly done, not sure what happened with those. > But nobody has ever seemed to care before. Because the inode hash just > isn't all that heavily used, since normal loads don't look up inodes > directly (they look up dentries, and you find the inodes off those - > and the *dentry* hash scales well thanks to both RCU and doing the > bucket lock thing). > > So the fact that bcachefs cares makes me go "Why?" I've seen benchmarks (not just on bcachefs, ext4 as well) where it was actually _really_ heavy. They were synthetics, but I'd expect to see the same effect from workloads where we're streaming a lot of inodes through the cache - think rsync. Another data point is that the btree key cache work I just did was also for exactly this type of situation - it addressed lock contention when streaming items into and out of the key cache, which are mostly inodes (it's also used for the alloc btree) - and I had reports from multiple users it was a drastic improvement, much more than expected, and eliminated the main source of our "srcu lock held too long" warnings - and if lock contention is bad enough to trip a 10 second warning, you know it's bad, heh. I try to really jump on these sorts of lock contention issues because I've seen many instances where they caused performance cliffs - in practice they're often O(n^2) issues because as your workload scales up it's usually not just number of instances where you take the lock that increases, usually something else is scaling too - even just increased cacheline bouncing from the lock contention itself will be problematic.