Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 5 Feb 2024 14:22:18 +1100

On Fri, Feb 02, 2024 at 07:32:36PM +0000, Matthew Wilcox wrote:
> On Fri, Feb 02, 2024 at 03:03:51PM +0000, Matthew Wilcox wrote:
> > On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote:
> > > In the struct address_space, there is a 32-byte gap between i_mmap
> > > and i_mmap_rwsem. Due to the alignment of struct address_space
> > > variables to 8 bytes, in certain situations, i_mmap and
> > > i_mmap_rwsem may end up in the same CACHE line.
> > > 
> > > While running Unixbench/execl, we observe high false sharing issues
> > > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
> > > after i_private_list, ensuring a 64-byte gap between i_mmap and
> > > i_mmap_rwsem.
> > 
> > I'm confused.  i_mmap_rwsem protects i_mmap.  Usually you want the lock
> > and the thing it's protecting in the same cacheline.

You are correct in the case that there is never any significant
contention on the lock. i.e.  gaining the lock will also pull the
cacheline for the object it protects and so avoid an extra memory
fetch.

However....

> > Why is that not
> > the case here?
> 
> We actually had this seven months ago:
> 
> https://lore.kernel.org/all/20230628105624.150352-1-lipeng.zhu@xxxxxxxxx/
> 
> Unfortunately, no argumentation was forthcoming about *why* this was
> the right approach.  All we got was a different patch and an assertion
> that it still improved performance.
> 
> We need to understand what's going on!  Please don't do the same thing
> as the other submitter and just assert that it does.

Intuition tells me that what the OP is seeing is the opposite case
to above: there is significant contention on the lock. In that case,
optimal "contention performance" comes from separating the lock and
the objects it protects into different cachelines.

The reason for this is that if the lock and objects it protects are
on the same cacheline, lock contention affects both the lock and the
objects being manipulated inside the critical section. i.e. attempts
to grab the lock pull the cacheline away from the CPU that holds the
lock, and then accesses to the object that are protected by the lock
then have to pull the cacheline back.

i.e. the cost of the extra memory fetch from an uncontended
cacheline is less than the cost of having to repeatedly fetch the
memory inside a critical section on a contended cacheline.

I consider optimisation attempts like this the canary in the mine:
it won't be long before these or similar workloads report
catastrophic lock contention on the lock in question.  Moving items
in the structure is equivalent to re-arranging the deck chairs
whilst the ship sinks - we might keep our heads above water a
little longer, but the ship is still sinking and we're still going
to have to fix the leak sooner rather than later...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx