On 11/13/24 15:28, Lorenzo Stoakes wrote: > On Tue, Nov 12, 2024 at 11:46:32AM -0800, Suren Baghdasaryan wrote: >> Back when per-vma locks were introduces, vm_lock was moved out of >> vm_area_struct in [1] because of the performance regression caused by >> false cacheline sharing. Recent investigation [2] revealed that the >> regressions is limited to a rather old Broadwell microarchitecture and >> even there it can be mitigated by disabling adjacent cacheline >> prefetching, see [3]. > > I don't see a motivating reason as to why we want to do this? We increase > memory usage here which is not good, but later lock optimisation mitigates I'd say we don't normally split logically single structures into multiple structures just because they might pack better in multiple slabs vs single slab. Because that means more complicated management, extra pointer dereferences etc. And if that split away part is a lock, it even complicates things further. So the only motivation for doing that split was that it appeared to perform better, but that was found to be misleading. But sure it can be described better, and include the new SLAB_TYPESAFE_BY_RCU conversion part as the motivation - that would be likely impossible to do with a split away lock. > it, but why wouldn't we just do the lock optimisations and use less memory > overall? If the lock is made much smaller then the packing benefit by split might disappear, as is the case here. >> This patchset moves vm_lock back into vm_area_struct, aligning it at the >> cacheline boundary and changing the cache to be cache-aligned as well. >> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40 >> (vm_lock) bytes to 256 bytes: >> >> slabinfo before: >> <name> ... <objsize> <objperslab> <pagesperslab> : ... >> vma_lock ... 40 102 1 : ... >> vm_area_struct ... 160 51 2 : ... > > Pedantry, but it might be worth mentioning how much this can vary by config. > > For instance, on my machine: > > vm_area_struct 125238 138820 184 44 > >> >> slabinfo after moving vm_lock: >> <name> ... <objsize> <objperslab> <pagesperslab> : ... >> vm_area_struct ... 256 32 2 : ... >> >> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages, >> which is 5.5MB per 100000 VMAs. This memory consumption growth can be >> addressed later by optimizing the vm_lock. > > Yes grabbing this back is of critical importance I'd say! :) I doubt it's that critical. We'll have to weight that against introducing another non-standard locking primitive. > Functionally it looks ok to me but would like to see a stronger > justification in the commit msg! :) > >> >> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@xxxxxxxxxx/ >> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/ >> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@xxxxxxxxxxxxxx/ >> >> Signed-off-by: Suren Baghdasaryan <surenb@xxxxxxxxxx> >> --- >> include/linux/mm.h | 28 +++++++++++++---------- >> include/linux/mm_types.h | 6 +++-- >> kernel/fork.c | 49 ++++------------------------------------ >> 3 files changed, 25 insertions(+), 58 deletions(-) >>