[PATCH 0/4] move per-vma lock into vm_area_struct

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Mon, 11 Nov 2024 12:55:02 -0800

Back when per-vma locks were introduces, vm_lock was moved out of
vm_area_struct in [1] because of the performance regression caused by
false cacheline sharing. Recent investigation [2] revealed that the
regressions is limited to a rather old Broadwell microarchitecture and
even there it can be mitigated by disabling adjacent cacheline
prefetching, see [3].
This patchset moves vm_lock back into vm_area_struct, aligning it at the
cacheline boundary and changing the cache to be cache-aligned as well.
This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
(vm_lock) bytes to 256 bytes:

    slabinfo before:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vma_lock         ...     40  102    1 : ...
     vm_area_struct   ...    160   51    2 : ...

    slabinfo after moving vm_lock:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    256   32    2 : ...

Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
which is 5.5MB per 100000 VMAs.
To minimize memory overhead, vm_lock implementation is changed from
using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
vm_area_struct members are moved into the last cacheline, resulting
in a less fragmented structure:

struct vm_area_struct {
	union {
		struct {
			long unsigned int vm_start;      /*     0     8 */
			long unsigned int vm_end;        /*     8     8 */
		};                                       /*     0    16 */
		struct callback_head vm_rcu ;            /*     0    16 */
	} __attribute__((__aligned__(8)));               /*     0    16 */
	struct mm_struct *         vm_mm;                /*    16     8 */
	pgprot_t                   vm_page_prot;         /*    24     8 */
	union {
		const vm_flags_t   vm_flags;             /*    32     8 */
		vm_flags_t         __vm_flags;           /*    32     8 */
	};                                               /*    32     8 */
	bool                       detached;             /*    40     1 */

	/* XXX 3 bytes hole, try to pack */

	unsigned int               vm_lock_seq;          /*    44     4 */
	struct list_head           anon_vma_chain;       /*    48    16 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	struct anon_vma *          anon_vma;             /*    64     8 */
	const struct vm_operations_struct  * vm_ops;     /*    72     8 */
	long unsigned int          vm_pgoff;             /*    80     8 */
	struct file *              vm_file;              /*    88     8 */
	void *                     vm_private_data;      /*    96     8 */
	atomic_long_t              swap_readahead_info;  /*   104     8 */
	struct mempolicy *         vm_policy;            /*   112     8 */

	/* XXX 8 bytes hole, try to pack */

	/* --- cacheline 2 boundary (128 bytes) --- */
	struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */

	/* XXX 4 bytes hole, try to pack */

	struct {
		struct rb_node     rb (__aligned__(8));  /*   136    24 */
		long unsigned int  rb_subtree_last;      /*   160     8 */
	} __attribute__((__aligned__(8))) shared;        /*   136    32 */
	struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */

	/* size: 192, cachelines: 3, members: 17 */
	/* sum members: 153, holes: 3, sum holes: 15 */
	/* padding: 24 */
	/* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
} __attribute__((__aligned__(64)));

Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
to the 50 pages in the baseline:

    slabinfo after vm_area_struct changes:
     <name>           ... <objsize> <objperslab> <pagesperslab> : ...
     vm_area_struct   ...    192   42    2 : ...

Performance measurements using pft test on x86 do not show considerable
difference, on Pixel 6 running Android it results in 3-5% improvement in
faults per second.

[1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@xxxxxxxxxx/
[2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
[3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@xxxxxxxxxxxxxx/

Suren Baghdasaryan (4):
  mm: introduce vma_start_read_locked{_nested} helpers
  mm: move per-vma lock into vm_area_struct
  mm: replace rw_semaphore with atomic_t in vma_lock
  mm: move lesser used vma_area_struct members into the last cacheline

 include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
 include/linux/mm_types.h  |  59 +++++++++-----
 include/linux/mmap_lock.h |   3 +
 kernel/fork.c             |  50 ++----------
 mm/init-mm.c              |   2 +
 mm/userfaultfd.c          |  14 ++--
 6 files changed, 205 insertions(+), 86 deletions(-)

base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
-- 
2.47.0.277.g8800431eea-goog