Re: [PATCH 0/4] move per-vma lock into vm_area_struct

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Mon, 11 Nov 2024 13:41:08 -0800

On Mon, Nov 11, 2024 at 12:55 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
>
> Back when per-vma locks were introduces, vm_lock was moved out of
> vm_area_struct in [1] because of the performance regression caused by
> false cacheline sharing. Recent investigation [2] revealed that the
> regressions is limited to a rather old Broadwell microarchitecture and
> even there it can be mitigated by disabling adjacent cacheline
> prefetching, see [3].
> This patchset moves vm_lock back into vm_area_struct, aligning it at the
> cacheline boundary and changing the cache to be cache-aligned as well.
> This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> (vm_lock) bytes to 256 bytes:
>
>     slabinfo before:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vma_lock         ...     40  102    1 : ...
>      vm_area_struct   ...    160   51    2 : ...
>
>     slabinfo after moving vm_lock:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    256   32    2 : ...
>
> Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> which is 5.5MB per 100000 VMAs.
> To minimize memory overhead, vm_lock implementation is changed from
> using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> vm_area_struct members are moved into the last cacheline, resulting
> in a less fragmented structure:
>
> struct vm_area_struct {
>         union {
>                 struct {
>                         long unsigned int vm_start;      /*     0     8 */
>                         long unsigned int vm_end;        /*     8     8 */
>                 };                                       /*     0    16 */
>                 struct callback_head vm_rcu ;            /*     0    16 */
>         } __attribute__((__aligned__(8)));               /*     0    16 */
>         struct mm_struct *         vm_mm;                /*    16     8 */
>         pgprot_t                   vm_page_prot;         /*    24     8 */
>         union {
>                 const vm_flags_t   vm_flags;             /*    32     8 */
>                 vm_flags_t         __vm_flags;           /*    32     8 */
>         };                                               /*    32     8 */
>         bool                       detached;             /*    40     1 */
>
>         /* XXX 3 bytes hole, try to pack */
>
>         unsigned int               vm_lock_seq;          /*    44     4 */
>         struct list_head           anon_vma_chain;       /*    48    16 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         struct anon_vma *          anon_vma;             /*    64     8 */
>         const struct vm_operations_struct  * vm_ops;     /*    72     8 */
>         long unsigned int          vm_pgoff;             /*    80     8 */
>         struct file *              vm_file;              /*    88     8 */
>         void *                     vm_private_data;      /*    96     8 */
>         atomic_long_t              swap_readahead_info;  /*   104     8 */
>         struct mempolicy *         vm_policy;            /*   112     8 */
>
>         /* XXX 8 bytes hole, try to pack */
>
>         /* --- cacheline 2 boundary (128 bytes) --- */
>         struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
>
>         /* XXX 4 bytes hole, try to pack */
>
>         struct {
>                 struct rb_node     rb (__aligned__(8));  /*   136    24 */
>                 long unsigned int  rb_subtree_last;      /*   160     8 */
>         } __attribute__((__aligned__(8))) shared;        /*   136    32 */
>         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
>
>         /* size: 192, cachelines: 3, members: 17 */
>         /* sum members: 153, holes: 3, sum holes: 15 */
>         /* padding: 24 */
>         /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> } __attribute__((__aligned__(64)));
>
> Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> to the 50 pages in the baseline:
>
>     slabinfo after vm_area_struct changes:
>      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
>      vm_area_struct   ...    192   42    2 : ...
>
> Performance measurements using pft test on x86 do not show considerable
> difference, on Pixel 6 running Android it results in 3-5% improvement in
> faults per second.
>
> [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@xxxxxxxxxx/
> [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@xxxxxxxxxxxxxx/

And of course I forgot to update Lorenzo's new locking documentation :/
Will add that in the next version.

>
> Suren Baghdasaryan (4):
>   mm: introduce vma_start_read_locked{_nested} helpers
>   mm: move per-vma lock into vm_area_struct
>   mm: replace rw_semaphore with atomic_t in vma_lock
>   mm: move lesser used vma_area_struct members into the last cacheline
>
>  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
>  include/linux/mm_types.h  |  59 +++++++++-----
>  include/linux/mmap_lock.h |   3 +
>  kernel/fork.c             |  50 ++----------
>  mm/init-mm.c              |   2 +
>  mm/userfaultfd.c          |  14 ++--
>  6 files changed, 205 insertions(+), 86 deletions(-)
>
>
> base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> --
> 2.47.0.277.g8800431eea-goog
>