Re: [GIT PULL] bcachefs changes for 6.12-rc1

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 24 Sep 2024 10:26:56 +1000

On Mon, Sep 23, 2024 at 03:56:50PM -0400, Kent Overstreet wrote:
> On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote:
> > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet
> > <kent.overstreet@xxxxxxxxx> wrote:
> > >
> > > We're now using an rhashtable instead of the system inode hash table;
> > > this is another significant performance improvement on multithreaded
> > > metadata workloads, eliminating more lock contention.
> > 
> > So I've pulled this, but I reacted to this issue - what is the load
> > where the inode hash table is actually causing issues?
> > 
> > Because honestly, the reason we're using that "one single lock" for
> > the inode hash table is that nobody has ever bothered.
> > 
> > In particular, it *should* be reasonably straightforward (lots of
> > small details, but not fundamentally hard) to just make the inode hash
> > table be a "bl" hash - with the locking done per-bucket, not using one
> > global lock.
> 
> Dave was working on that - he posted patches and it seemed like they
> were mostly done, not sure what happened with those.

I lost interest in that patchset a long while ago. The last posting
I did was largely as a community service to get the completed,
lockdep and RTPREEMPT compatible version of the hashbl
infrastructure it needed out there for other people to be able to
easily push this forward if they needed it. That was here:

https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@xxxxxxxxxxxxx/

and I posted it because Kent was asking about it because his
attempts to convert the inode hash to use rhashtables wasn't
working.

I've suggested multiple times since then when people have asked me
about that patch set that they are free to pick it up and finish it
off themselves. Unfortunately, that usually results in silence, a
comment of "I don't have time for that", and/or they go off and hack
around the issue in some other way....

> > But nobody has ever seemed to care before. Because the inode hash just
> > isn't all that heavily used, since normal loads don't look up inodes
> > directly (they look up dentries, and you find the inodes off those -
> > and the *dentry* hash scales well thanks to both RCU and doing the
> > bucket lock thing).

Not entirely true.

Yes, the dentry cache protects the inode hash from contention when
the workload repeatedly hits cached dentries.

However, the problematic workload is cold cache operations where
the dentry cache repeatedly misses. This places all the operational
concurrency directly on the inode hash as new inodes are inserted
into the hash. Add memory reclaim and that adds contention as it
removes inodes from the hash on eviction.

This happens in workloads that cycle lots of inode through memory.
Concurrent cold cache lookups, create and unlink hammer the global
filesystem inode hash lock at more than 4 active tasks, especially
when there is also concurrent memory reclaim running evicting inodes
from the inode hash.

This inode cache miss contention issue was sort-of hacked around
recently by commit 7180f8d91fcb ("vfs: add rcu-based find_inode
variants for iget ops"). This allowed filesystems filesystems to use
-partially- RCU-based lookups rather fully locked lookups. In the
case of a hit, it will be an RCU operation, but as you state, cache
hits on the inode hash are the rare case, not the common case.

I say "sort-of hacked around" because the RCU lookup only avoids the
inode hash lock on the initial lookup that misses. Then insert is
still done under the inode_hash_lock and so inode_hash_lock
contention still definitely exists in this path. All the above
commit did was move the catastrophic cacheline contention breakdown
point to be higher than the test workload that was being run could
hit.

In review of that commit, I suggested that the inode hash_bl
conversion patches be finished off instead, but the response was "I
don't have time for that".

And so here we are again....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx