On Fri, Jun 07, 2024 at 09:51:51AM +0200, Mateusz Guzik wrote: > On Fri, Jun 07, 2024 at 12:04:58PM +1000, Dave Chinner wrote: > > On Thu, Jun 06, 2024 at 04:05:15PM +0200, Mateusz Guzik wrote: > > > Instantiating a new inode normally takes the global inode hash lock > > > twice: > > > 1. once to check if it happens to already be present > > > 2. once to add it to the hash > > > > > > The back-to-back lock/unlock pattern is known to degrade performance > > > significantly, which is further exacerbated if the hash is heavily > > > populated (long chains to walk, extending hold time). Arguably hash > > > sizing and hashing algo need to be revisited, but that's beyond the > > > scope of this patch. > > > > > > A long term fix would introduce fine-grained locking, this was attempted > > > in [1], but that patchset was already posted several times and appears > > > stalled. > > > > Why not just pick up those patches and drive them to completion? > > > > Time constraints on my end aside. > > From your own e-mail [1] last year problems are: > > > - A lack of recent validation against ext4, btrfs and other > > filesystems. > > - the loss of lockdep coverage by moving to bit locks > > - it breaks CONFIG_PREEMPT_RT=y because we nest other spinlocks > > inside the inode_hash_lock and we can't do that if we convert the > > inode hash to bit locks because RT makes spinlocks sleeping locks. > > - There's been additions for lockless RCU inode hash lookups from > > AFS and ext4 in weird, uncommon corner cases and I have no idea > > how to validate they still work correctly with hash-bl. I suspect > > they should just go away with hash-bl, but.... > > > There's more, but these are the big ones. > > I did see the lockdep and preempt_rt problem were patched up in a later > iteration ([2]). > > What we both agree on is that the patchset adds enough complexity that > it needs solid justification. I assumed one was there on your end when > you posted it. > > For that entire patchset I don't have one. I can however justify the > comparatively trivial thing I posted in this thread. I didn't suggest you pick up the entire patchset, I suggested you pull the inode cache conversion to hash-bl from it and use that instead of hacking RCU lookups into the inode cache. > That aside if I had to make the entire thing scale I would approach > things differently, most notably in terms of locking granularity. Per > your own statement things can be made to look great in microbenchmarks, > but that does not necessarily mean they help. A lot of it is a tradeoff > and making everything per-cpu for this particular problem may be taking > it too far. The hash-bl conversion doesn't make anything per-cpu, so I don't know what you are complaining about here. > For the inode hash it may be the old hack of having a lock array would > do it more than well enough -- say 32 locks (or some other number > dependent on hash size) each covering a dedicated subset. No, that's a mid 1990s-era hack that was done because we didn't know any better ways to scale global algorithms back then. It's not a scalable algorithm, it's a global algorithm that has been sharded a few times to try to keep global scope operations apart. The moment you have subset collisions because of a specific workload pattern, then the scalability problem comes straight back. We have been doing better than this for years, and it's not a viable solution for things like core OS cache indexing. We need a construct that is inherently scalable and has little cost to that inherent scalability. We have an inhernetly scalable solution already - hash-bl adds fine-grained locking without changing locking behaviour nor consuming extra memory. And it scales lookup, insert and remove with a common locking scheme across all operations. Your patch, however, just converts *some* of the lookup API operations to use RCU. It adds complexity for things like inserts which are going to need inode hash locking if the RCU lookup fails, anyway. Hence your patch optimises the case where the inode is in cache but the dentry isn't, but we'll still get massive contention on lookup when the RCU lookup on the inode cache and inserts are always going to be required. IOWs, even RCU lookups are not going to prevent inode hash lock contention for parallel cold cache lookups. Hence, with RCU, applications are going to see unpredictable contention behaviour dependent on the memory footprint of the caches at the time of the lookup. Users will have no way of predicting when the behaviour will change, let alone have any way of mitigating it. Unpredictable variable behaviour is the thing we want to avoid the most with core OS caches. In comparison, hash-bl based inode cache lookups will have the same behaviour for both cache hits and cache misses. There will be nearly zero contention for both types of lookups, as the insert can be done immediately under the same lock that the is held for the lookup. Hence performance will be predictable and largely deterministic under a wide variety of workloads and changing memory pressure. This is how we want the inode cache to behave. IOWs, the hash-bl solution is superior to RCU lookups for many reasons. RCU lookups might be fast, but it's not the best solution to every problem that requires scalable behaviour. > All that said, if someone(tm) wants to pick up your patchset and even > commit it the way it is right now I'm not going to protest anything. > > I don't have time nor justification to do full work My Way(tm). You *always* say this sort of thing when someone asks you to do something different. If you don't agree with the reviewer's comments, you make some statement about how you "don't care enough to fix it properly" or that you don't have time to fix it properly, or that you think the reviewing is asking you to "fix everything" rather than just one line in your patch so you reject all the reviewers comment, or you say "if the maintainers want something" as a way of saying "you don't have any authority, so I'm not going to listen to you at all", or .... > I did have time to hack up the initial rcu lookup, which imo does not > require much to justify inclusion and does help the case I'm interested > in. I thin kit requires a lot more justification than you have given, especially given the fact changing memory pressure, access patterns and cache balance will have a great impact on whether the RCU path makes any different to inode hash lock contention at all. The hash-bl conversion is a much a better solution and the work is mostly done. If you want your workload case to perform better, then please pick up the inode hash lock conversion to hash-bl. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx