Re: [PATCH 0/4] move per-vma lock into vm_area_struct

Suren Baghdasaryan <surenb@xxxxxxxxxx> · Mon, 11 Nov 2024 19:27:56 -0800



On Mon, Nov 11, 2024 at 6:48 PM 'Liam R. Howlett' via kernel-team
<kernel-team@xxxxxxxxxxx> wrote:
>
> * Suren Baghdasaryan <surenb@xxxxxxxxxx> [241111 16:41]:
> > On Mon, Nov 11, 2024 at 12:55 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote:
> > >
> > > Back when per-vma locks were introduces, vm_lock was moved out of
> > > vm_area_struct in [1] because of the performance regression caused by
> > > false cacheline sharing. Recent investigation [2] revealed that the
> > > regressions is limited to a rather old Broadwell microarchitecture and
> > > even there it can be mitigated by disabling adjacent cacheline
> > > prefetching, see [3].
> > > This patchset moves vm_lock back into vm_area_struct, aligning it at the
> > > cacheline boundary and changing the cache to be cache-aligned as well.
> > > This causes VMA memory consumption to grow from 160 (vm_area_struct) + 40
> > > (vm_lock) bytes to 256 bytes:
> > >
> > >     slabinfo before:
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vma_lock         ...     40  102    1 : ...
> > >      vm_area_struct   ...    160   51    2 : ...
> > >
> > >     slabinfo after moving vm_lock:
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vm_area_struct   ...    256   32    2 : ...
> > >
> > > Aggregate VMA memory consumption per 1000 VMAs grows from 50 to 64 pages,
> > > which is 5.5MB per 100000 VMAs.
> > > To minimize memory overhead, vm_lock implementation is changed from
> > > using rw_semaphore (40 bytes) to an atomic (8 bytes) and several
> > > vm_area_struct members are moved into the last cacheline, resulting
> > > in a less fragmented structure:
>
> Wait a second, this is taking 40B down to 8B, but the alignment of the
> vma will surely absorb that 32B difference?  The struct sum is 153B
> according to what you have below so we won't go over 192B.  What am I
> missing?

Take a look at the last patch in the series called "[PATCH 4/4] mm:
move lesser used vma_area_struct members into the last cacheline". I
move some struct members from the earlier cachelines into cacheline #4
where the vm_lock is staying.
>
> > >
> > > struct vm_area_struct {
> > >         union {
> > >                 struct {
> > >                         long unsigned int vm_start;      /*     0     8 */
> > >                         long unsigned int vm_end;        /*     8     8 */
> > >                 };                                       /*     0    16 */
> > >                 struct callback_head vm_rcu ;            /*     0    16 */
> > >         } __attribute__((__aligned__(8)));               /*     0    16 */
> > >         struct mm_struct *         vm_mm;                /*    16     8 */
> > >         pgprot_t                   vm_page_prot;         /*    24     8 */
> > >         union {
> > >                 const vm_flags_t   vm_flags;             /*    32     8 */
> > >                 vm_flags_t         __vm_flags;           /*    32     8 */
> > >         };                                               /*    32     8 */
> > >         bool                       detached;             /*    40     1 */
> > >
> > >         /* XXX 3 bytes hole, try to pack */
> > >
> > >         unsigned int               vm_lock_seq;          /*    44     4 */
> > >         struct list_head           anon_vma_chain;       /*    48    16 */
> > >         /* --- cacheline 1 boundary (64 bytes) --- */
> > >         struct anon_vma *          anon_vma;             /*    64     8 */
> > >         const struct vm_operations_struct  * vm_ops;     /*    72     8 */
> > >         long unsigned int          vm_pgoff;             /*    80     8 */
> > >         struct file *              vm_file;              /*    88     8 */
> > >         void *                     vm_private_data;      /*    96     8 */
> > >         atomic_long_t              swap_readahead_info;  /*   104     8 */
> > >         struct mempolicy *         vm_policy;            /*   112     8 */
> > >
> > >         /* XXX 8 bytes hole, try to pack */
> > >
> > >         /* --- cacheline 2 boundary (128 bytes) --- */
> > >         struct vma_lock       vm_lock (__aligned__(64)); /*   128     4 */
> > >
> > >         /* XXX 4 bytes hole, try to pack */
> > >
> > >         struct {
> > >                 struct rb_node     rb (__aligned__(8));  /*   136    24 */
> > >                 long unsigned int  rb_subtree_last;      /*   160     8 */
> > >         } __attribute__((__aligned__(8))) shared;        /*   136    32 */
> > >         struct vm_userfaultfd_ctx  vm_userfaultfd_ctx;   /*   168     0 */
> > >
> > >         /* size: 192, cachelines: 3, members: 17 */
> > >         /* sum members: 153, holes: 3, sum holes: 15 */
> > >         /* padding: 24 */
> > >         /* forced alignments: 3, forced holes: 2, sum forced holes: 12 */
> > > } __attribute__((__aligned__(64)));
> > >
> > > Memory consumption per 1000 VMAs becomes 48 pages, saving 2 pages compared
> > > to the 50 pages in the baseline:
> > >
> > >     slabinfo after vm_area_struct changes:
> > >      <name>           ... <objsize> <objperslab> <pagesperslab> : ...
> > >      vm_area_struct   ...    192   42    2 : ...
> > >
> > > Performance measurements using pft test on x86 do not show considerable
> > > difference, on Pixel 6 running Android it results in 3-5% improvement in
> > > faults per second.
> > >
> > > [1] https://lore.kernel.org/all/20230227173632.3292573-34-surenb@xxxxxxxxxx/
> > > [2] https://lore.kernel.org/all/ZsQyI%2F087V34JoIt@xsang-OptiPlex-9020/
> > > [3] https://lore.kernel.org/all/CAJuCfpEisU8Lfe96AYJDZ+OM4NoPmnw9bP53cT_kbfP_pR+-2g@xxxxxxxxxxxxxx/
> >
> > And of course I forgot to update Lorenzo's new locking documentation :/
> > Will add that in the next version.
> >
> > >
> > > Suren Baghdasaryan (4):
> > >   mm: introduce vma_start_read_locked{_nested} helpers
> > >   mm: move per-vma lock into vm_area_struct
> > >   mm: replace rw_semaphore with atomic_t in vma_lock
> > >   mm: move lesser used vma_area_struct members into the last cacheline
> > >
> > >  include/linux/mm.h        | 163 +++++++++++++++++++++++++++++++++++---
> > >  include/linux/mm_types.h  |  59 +++++++++-----
> > >  include/linux/mmap_lock.h |   3 +
> > >  kernel/fork.c             |  50 ++----------
> > >  mm/init-mm.c              |   2 +
> > >  mm/userfaultfd.c          |  14 ++--
> > >  6 files changed, 205 insertions(+), 86 deletions(-)
> > >
> > >
> > > base-commit: 931086f2a88086319afb57cd3925607e8cda0a9f
> > > --
> > > 2.47.0.277.g8800431eea-goog
> > >
>
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@xxxxxxxxxxx.
>