On Wed, Jan 15, 2025 at 06:38:00AM +0100, Christoph Hellwig wrote: > On Tue, Jan 14, 2025 at 07:55:30AM +1100, Dave Chinner wrote: > > The idea behind the initial cacheline layout is that it should stay > > read-only as much as possible so that cache lookups can walk the > > buffer without causing shared/exclusive cacheline contention with > > existing buffer users. > > > > This was really important back in the days when the cache used a > > rb-tree (i.e. the rbnode pointers dominated lookup profiles), and > > it's still likely important with the rhashtable on large caches. > > > > i.e. Putting a spinlock in that first cache line will result in > > lookups and shrinker walks having cacheline contention as the > > shrinker needs exclusive access for the spin lock, whilst the lookup > > walk needs shared access for the b_rhash_head, b_rhash_key and > > b_length fields in _xfs_buf_obj_cmp() for lookless lookup > > concurrency. > > Hmm, this contradict the comment on top of xfs_buf, which explicitly > wants the lock and count in the semaphore to stay in the first cache > line. The semaphore, yes, because locking the buffer is something the fast path lookup does, and buffer locks are rarely contended. Shrinkers, OTOH, work on the b_lru_ref count and use the b_lock spin lock right up until the point that the buffer is going to be reclaimed. These are not shared with the cache lines accessed by lookups. Indeed, it looks to me like the historic placing of the b_lru_ref on the first cacheline is now incorrect, because it is no longer modified during lookup - we moved that to the lookup callers a long time ago. i.e. shrinker reclaim shouldn't touch the first cacheline until it is goign to reclaim the buffer. A racing lookup at this point is also very rare, so the fact it modifies the first cacheline of the buffer is fine - it's going to need that exclusive to remove it from the cache, anyway. IOWs, the current separate largely keeps the lookup fast path and shrinker reclaim operating on different cachelines in the same buffer object, and hence they don't interfere with each other. However, the change to to use the b_lock and a non-atomic hold count means that every time a shrinker scans a buffer - even before looking at the lru ref count - it will pull the first cache line exclusive due to the unconditional spin lock attempt it now makes. When we are under tight memory pressure, only the frequently referenced buffers will stay in memory (hence lookup hits them), and they will be scanned by reclaim just as frequently as they are accessed by the filesystem to keep them referenced and on the LRUs... > These, similar to the count that already is in the cacheline > and the newly moved lock (which would still keep the semaphore partial > layout) are modified for the uncontended lookup there. Note that > since the comment was written b_sema actually moved entirely into > the first cache line, and this patch keeps it there, nicely aligning > b_lru_ref on my x86_64 no-debug config. The comment was written back in the days of the rbtree based index, where all we could fit on the first cacheline was the rbnode, the lookup critical fields (daddr, length, flags), the buffer data offset (long gone) and the part of the semaphore structure involved in locking the semaphore... While the code may not exactly match the comment anymore, the comment is actually still valid and correct and we should be fixing the code to match the comment, not making the situation worse... > Now I'm usually pretty bad about these cacheline micro-optimizations > and I'm talking to the person who wrote that comment here, so that > rationale might not make sense, but then the comment doesn't either. > > I'm kinda tempted to just stick to the rationale there for now and then > let someone smarter than me optimize the layout for the new world order. I'd just leave b_lock where it is for now - if it is now going to be contended between lookup and reclaim, we want it isolated to a cacheline that minimises contention with other lookup related data.... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx