Re: [PATCH] fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 6 Feb 2024 10:15:50 +1100

On Mon, Feb 05, 2024 at 02:22:29PM +0800, JonasZhou wrote:
> > On Fri, Feb 02, 2024 at 03:03:51PM +0000, Matthew Wilcox wrote:
> > > On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote:
> > > > In the struct address_space, there is a 32-byte gap between i_mmap
> > > > and i_mmap_rwsem. Due to the alignment of struct address_space
> > > > variables to 8 bytes, in certain situations, i_mmap and
> > > > i_mmap_rwsem may end up in the same CACHE line.
> > > > 
> > > > While running Unixbench/execl, we observe high false sharing issues
> > > > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem
> > > > after i_private_list, ensuring a 64-byte gap between i_mmap and
> > > > i_mmap_rwsem.
> > > 
> > > I'm confused.  i_mmap_rwsem protects i_mmap.  Usually you want the lock
> > > and the thing it's protecting in the same cacheline.  Why is that not
> > > the case here?
> >
> > We actually had this seven months ago:
> >
> > https://lore.kernel.org/all/20230628105624.150352-1-lipeng.zhu@xxxxxxxxx/
> >
> > Unfortunately, no argumentation was forthcoming about *why* this was
> > the right approach.  All we got was a different patch and an assertion
> > that it still improved performance.
> >
> > We need to understand what's going on!  Please don't do the same thing
> > as the other submitter and just assert that it does.
> 
> When running UnixBench/execl, each execl process repeatedly performs 
> i_mmap_lock_write -> vma_interval_tree_remove/insert -> 
> i_mmap_unlock_write. As indicated below, when i_mmap and i_mmap_rwsem 
> are in the same CACHE Line, there will be more HITM.

As I expected, your test is exercising the contention case rather
than the single, uncontended case. As such, your patch is simply
optimising the structure layout for the contended case at the
expense of an extra cacheline miss in the uncontended case.

I'm not an mm expert, so I don't know which case we should optimise
for.

However, the existing code is not obviously wrong, it's just that
your micro-benchmark exercises the pathological worst case for the
optimisation choices made for this structure. Whether the contention
case is worth optimising is the first decision that needs to be
made, then people can decide if hacking minor optimisations into the
code is better than reworking the locking and/or algorithm to avoid
the contention altogether is a better direction...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx