Re: [PATCH 0/4] move per-vma lock into vm_area_struct

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Mon, 11 Nov 2024 19:23:29 -0800

On Mon, Nov 11, 2024 at 6:35 PM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
>
> * Suren Baghdasaryan <surenb@xxxxxxxxxx> [241111 15:55]:
> > Back when per-vma locks were introduces, vm_lock was moved out of
> > vm_area_struct in [1] because of the performance regression caused by
> > false cacheline sharing. Recent investigation [2] revealed that the
> > regressions is limited to a rather old Broadwell microarchitecture and
> > even there it can be mitigated by disabling adjacent cacheline
> > prefetching, see [3].
> > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > cacheline boundary and changing the cache to be cache-aligned as well.
> > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > (vm_lock) bytes to 256 bytes:
> >
> >     slabinfo before:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vma_lock         ...     40  102    1 : ...
> >      vm_area_struct   ...    160   51    2 : ...
> >
> >     slabinfo after moving vm_lock:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    256   32    2 : ...
> >
> > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > which is 5.5MB per 100000 VMAs.
> > To minimize memory overhead, vm_lock implementation is changed from
> > using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > vm_area_struct members are moved into the last cacheline, resulting
> > in a less fragmented structure:
> >
> > struct vm_area_struct {
> >       union {
> >               struct {
> >                       long unsigned int vm_start;      /*     0     8 */
> >                       long unsigned int vm_end;        /*     8     8 */
> >               };                                       /*     0    16 */
> >               struct callback_head vm_rcu ;            /*     0    16 */
> >       } __attribute__((__aligned__(8)));               /*     0    16 */
> >       struct mm_struct *         vm_mm;                /*    16     8 */
> >       pgprot_t                   vm_page_prot;         /*    24     8 */
> >       union {
> >               const vm_flags_t   vm_flags;             /*    32     8 */
> >               vm_flags_t         __vm_flags;           /*    32     8 */
> >       };                                               /*    32     8 */
> >       bool                       detached;             /*    40     1 */
> >
> >       /* XXX 3 bytes hole, try to pack */
> >
> >       unsigned int               vm_lock_seq;          /*    44     4 */
> >       struct list_head           anon_vma_chain;       /*    48    16 */ 40 + 16
> >       /* --- cacheline 1 boundary (64 bytes) --- */
> >       struct anon_vma *          anon_vma;             /*    64     8 */ 56 + 8
> >       const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> >       long unsigned int          vm_pgoff;             /*    80     8 */
> >       struct file *              vm_file;              /*    88     8 */
> >       void *                     vm_private_data;      /*    96     8 */
> >       atomic_long_t              swap_readahead_info;  /*   104     8 */
> >       struct mempolicy *         vm_policy;            /*   112     8 */
> >
> >       /* XXX 8 bytes hole, try to pack */
> >
> >       /* --- cacheline 2 boundary (128 bytes) --- */
> >       struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> >
> >       /* XXX 4 bytes hole, try to pack */
> >
> >       struct {
> >               struct rb_node     rb (__aligned__(8));  /*   136    24 */
> >               long unsigned int  rb_subtree_last;      /*   160     8 */
> >       } __attribute__((__aligned__(8))) shared;        /*   136    32 */
> >       struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
>
> This is 8 bytes on my compile, I guess you have userfaultfd disabled?

Yeah, I show here results with the defconfig. After I move things
around at the end we will have space for to keep everything under 3
cachelines.

>
> Considering this will end up being 256 anyways these changes may not
> matter, but we can pack this better.
> 1. move vm_lock to after anon_vma (ends up at 64B in the end)
> 2. move vm_lock_seq to after vm_lock

Nope, for performance reasons it's important to keep vm_lock_seq in
the first cacheline. It's used very extensively when read-locking the
VMA.

> 4. move struct to the new 112 offset (which is 8B aligned at 112)
> 3. move detached to the end of the structure

detached also should stay in the first cacheline, otherwise we will
get performance regression. I spent a week experimenting with what we
can and can't move :)

>
> This should limit the holes and maintain the alignments.
>
> The down side is detached is now in the last used cache line and away
> from what would probably be used with it, but maybe it doesn't matter
> considering prefetch.

It very much does matter :)

>
> But maybe you have other reasons for the placement?
>
> >
> >       /* size: 192, cachelines: 3, members: 17 */
> >       /* sum members: 153, holes: 3, sum holes: 15 */
> >       /* padding: 24 */
> >       /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> > } __attribute__((__aligned__(64)));
> >
> > Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> > to the 50 pages in the baseline:
> >
> >     slabinfo after vm_area_struct changes:
> >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> >      vm_area_struct   ...    192   42    2 : ...
> >
> > Performance measurements using pft test on x86 do not show considerable
> > difference, on Pixel 6 running Android it results in 3-5% improvement in
> > faults per second.
> >
> > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@xxxxxxxxxx/
> > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@xxxxxxxxxxxxxx/
> >
> > Suren Baghdasaryan (4):
> >   mm: introduce vma_start_read_locked{_nested} helpers
> >   mm: move per-vma lock into vm_area_struct
> >   mm: replace rw_semaphore with atomic_t in vma_lock
> >   mm: move lesser used vma_area_struct members into the last cacheline
> >
> >  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
> >  include/linux/mm_types.h  |  59 +++++++++-----
> >  include/linux/mmap_lock.h |   3 +
> >  kernel/fork.c             |  50 ++----------
> >  mm/init-mm.c              |   2 +
> >  mm/userfaultfd.c          |  14 ++--
> >  6 files changed, 205 insertions(+), 86 deletions(-)
> >
> >
> > base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> > --
> > 2.47.0.277.g8800431eea-goog
> >