On Fri, Feb 02, 2024 at 07:32:36PM +0000, Matthew Wilcox wrote: > On Fri, Feb 02, 2024 at 03:03:51PM +0000, Matthew Wilcox wrote: > > On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote: > > > In the struct address_space, there is a 32-byte gap between i_mmap > > > and i_mmap_rwsem. Due to the alignment of struct address_space > > > variables to 8 bytes, in certain situations, i_mmap and > > > i_mmap_rwsem may end up in the same CACHE line. > > > > > > While running Unixbench/execl, we observe high false sharing issues > > > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem > > > after i_private_list, ensuring a 64-byte gap between i_mmap and > > > i_mmap_rwsem. > > > > I'm confused. i_mmap_rwsem protects i_mmap. Usually you want the lock > > and the thing it's protecting in the same cacheline. You are correct in the case that there is never any significant contention on the lock. i.e. gaining the lock will also pull the cacheline for the object it protects and so avoid an extra memory fetch. However.... > > Why is that not > > the case here? > > We actually had this seven months ago: > > https://lore.kernel.org/all/20230628105624.150352-1-lipeng.zhu@xxxxxxxxx/ > > Unfortunately, no argumentation was forthcoming about *why* this was > the right approach. All we got was a different patch and an assertion > that it still improved performance. > > We need to understand what's going on! Please don't do the same thing > as the other submitter and just assert that it does. Intuition tells me that what the OP is seeing is the opposite case to above: there is significant contention on the lock. In that case, optimal "contention performance" comes from separating the lock and the objects it protects into different cachelines. The reason for this is that if the lock and objects it protects are on the same cacheline, lock contention affects both the lock and the objects being manipulated inside the critical section. i.e. attempts to grab the lock pull the cacheline away from the CPU that holds the lock, and then accesses to the object that are protected by the lock then have to pull the cacheline back. i.e. the cost of the extra memory fetch from an uncontended cacheline is less than the cost of having to repeatedly fetch the memory inside a critical section on a contended cacheline. I consider optimisation attempts like this the canary in the mine: it won't be long before these or similar workloads report catastrophic lock contention on the lock in question. Moving items in the structure is equivalent to re-arranging the deck chairs whilst the ship sinks - we might keep our heads above water a little longer, but the ship is still sinking and we're still going to have to fix the leak sooner rather than later... -Dave. -- Dave Chinner david@xxxxxxxxxxxxx