On Mon, Sep 23, 2024 at 03:56:50PM -0400, Kent Overstreet wrote: > On Mon, Sep 23, 2024 at 10:18:57AM GMT, Linus Torvalds wrote: > > On Sat, 21 Sept 2024 at 12:28, Kent Overstreet > > <kent.overstreet@xxxxxxxxx> wrote: > > > > > > We're now using an rhashtable instead of the system inode hash table; > > > this is another significant performance improvement on multithreaded > > > metadata workloads, eliminating more lock contention. > > > > So I've pulled this, but I reacted to this issue - what is the load > > where the inode hash table is actually causing issues? > > > > Because honestly, the reason we're using that "one single lock" for > > the inode hash table is that nobody has ever bothered. > > > > In particular, it *should* be reasonably straightforward (lots of > > small details, but not fundamentally hard) to just make the inode hash > > table be a "bl" hash - with the locking done per-bucket, not using one > > global lock. > > Dave was working on that - he posted patches and it seemed like they > were mostly done, not sure what happened with those. I lost interest in that patchset a long while ago. The last posting I did was largely as a community service to get the completed, lockdep and RTPREEMPT compatible version of the hashbl infrastructure it needed out there for other people to be able to easily push this forward if they needed it. That was here: https://lore.kernel.org/linux-fsdevel/20231206060629.2827226-1-david@xxxxxxxxxxxxx/ and I posted it because Kent was asking about it because his attempts to convert the inode hash to use rhashtables wasn't working. I've suggested multiple times since then when people have asked me about that patch set that they are free to pick it up and finish it off themselves. Unfortunately, that usually results in silence, a comment of "I don't have time for that", and/or they go off and hack around the issue in some other way.... > > But nobody has ever seemed to care before. Because the inode hash just > > isn't all that heavily used, since normal loads don't look up inodes > > directly (they look up dentries, and you find the inodes off those - > > and the *dentry* hash scales well thanks to both RCU and doing the > > bucket lock thing). Not entirely true. Yes, the dentry cache protects the inode hash from contention when the workload repeatedly hits cached dentries. However, the problematic workload is cold cache operations where the dentry cache repeatedly misses. This places all the operational concurrency directly on the inode hash as new inodes are inserted into the hash. Add memory reclaim and that adds contention as it removes inodes from the hash on eviction. This happens in workloads that cycle lots of inode through memory. Concurrent cold cache lookups, create and unlink hammer the global filesystem inode hash lock at more than 4 active tasks, especially when there is also concurrent memory reclaim running evicting inodes from the inode hash. This inode cache miss contention issue was sort-of hacked around recently by commit 7180f8d91fcb ("vfs: add rcu-based find_inode variants for iget ops"). This allowed filesystems filesystems to use -partially- RCU-based lookups rather fully locked lookups. In the case of a hit, it will be an RCU operation, but as you state, cache hits on the inode hash are the rare case, not the common case. I say "sort-of hacked around" because the RCU lookup only avoids the inode hash lock on the initial lookup that misses. Then insert is still done under the inode_hash_lock and so inode_hash_lock contention still definitely exists in this path. All the above commit did was move the catastrophic cacheline contention breakdown point to be higher than the test workload that was being run could hit. In review of that commit, I suggested that the inode hash_bl conversion patches be finished off instead, but the response was "I don't have time for that". And so here we are again.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx