Re: [PATCH 2/2] xfs: fix buffer lookup vs release race

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 15 Jan 2025 22:21:00 +1100

On Wed, Jan 15, 2025 at 06:38:00AM +0100, Christoph Hellwig wrote:
> On Tue, Jan 14, 2025 at 07:55:30AM +1100, Dave Chinner wrote:
> > The idea behind the initial cacheline layout is that it should stay
> > read-only as much as possible so that cache lookups can walk the
> > buffer without causing shared/exclusive cacheline contention with
> > existing buffer users.
> > 
> > This was really important back in the days when the cache used a
> > rb-tree (i.e. the rbnode pointers dominated lookup profiles), and
> > it's still likely important with the rhashtable on large caches.
> > 
> > i.e. Putting a spinlock in that first cache line will result in
> > lookups and shrinker walks having cacheline contention as the
> > shrinker needs exclusive access for the spin lock, whilst the lookup
> > walk needs shared access for the b_rhash_head, b_rhash_key and
> > b_length fields in _xfs_buf_obj_cmp() for lookless lookup
> > concurrency.
> 
> Hmm, this contradict the comment on top of xfs_buf, which explicitly
> wants the lock and count in the semaphore to stay in the first cache
> line.

The semaphore, yes, because locking the buffer is something the fast
path lookup does, and buffer locks are rarely contended.

Shrinkers, OTOH, work on the b_lru_ref count and use the b_lock spin
lock right up until the point that the buffer is going to be
reclaimed. These are not shared with the cache lines accessed by
lookups.

Indeed, it looks to me like the historic placing of the b_lru_ref on
the first cacheline is now incorrect, because it is no longer
modified during lookup - we moved that to the lookup callers a long
time ago.

i.e. shrinker reclaim shouldn't touch the first cacheline until it
is goign to reclaim the buffer.  A racing lookup at this point is
also very rare, so the fact it modifies the first cacheline of the
buffer is fine - it's going to need that exclusive to remove it from
the cache, anyway.

IOWs, the current separate largely keeps the lookup fast path and
shrinker reclaim operating on different cachelines in the same
buffer object, and hence they don't interfere with each other.

However, the change to to use the b_lock and a non-atomic hold count
means that every time a shrinker scans a buffer - even before
looking at the lru ref count - it will pull the first cache line
exclusive due to the unconditional spin lock attempt it now makes.

When we are under tight memory pressure, only the frequently
referenced buffers will stay in memory (hence lookup hits them), and
they will be scanned by reclaim just as frequently as they are
accessed by the filesystem to keep them referenced and on the LRUs...

> These, similar to the count that already is in the cacheline
> and the newly moved lock (which would still keep the semaphore partial
> layout) are modified for the uncontended lookup there.  Note that
> since the comment was written b_sema actually moved entirely into
> the first cache line, and this patch keeps it there, nicely aligning
> b_lru_ref on my x86_64 no-debug config.

The comment was written back in the days of the rbtree based index,
where all we could fit on the first cacheline was the rbnode, the
lookup critical fields (daddr, length, flags), the buffer data
offset (long gone) and the part of the
semaphore structure involved in locking the semaphore...

While the code may not exactly match the comment anymore, the
comment is actually still valid and correct and we should be fixing
the code to match the comment, not making the situation worse...

> Now I'm usually pretty bad about these cacheline micro-optimizations
> and I'm talking to the person who wrote that comment here, so that
> rationale might not make sense, but then the comment doesn't either.
> 
> I'm kinda tempted to just stick to the rationale there for now and then
> let someone smarter than me optimize the layout for the new world order.

I'd just leave b_lock where it is for now - if it is now going to be
contended between lookup and reclaim, we want it isolated to a
cacheline that minimises contention with other lookup related
data....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx