Re: [linus:master] [mm] 0ba09b1733: will-it-scale.per_thread_ops -21.1% regression in mmap1 benchmark

Liam Howlett <liam.howlett@xxxxxxxxxx> · Fri, 23 Dec 2022 02:45:17 +0000

* Yin, Fengwei <fengwei.yin@xxxxxxxxx> [221221 20:19]:
> 
> 
> On 12/22/2022 12:45 AM, Yang Shi wrote:
> >> We caught two mmap1 regressions on mailine, please see the data below:
> >>
> >> 830b3c68c1fb1 Linux 6.1                                                              2085 2355 2088
> >> 76dcd734eca23 Linux 6.1-rc8                                                          2093 2082 2094 2073 2304 2088
> >> 0ba09b1733878 Revert "mm: align larger anonymous mappings on THP boundaries"         2124 2286 2086 2114 2065 2081
> >> 23393c6461422 char: tpm: Protect tpm_pm_suspend with locks                           2756 2711 2689 2696 2660 2665
> >> b7b275e60bcd5 Linux 6.1-rc7                                                          2670 2656 2720 2691 2667
> >> ...
> >> 9abf2313adc1c Linux 6.1-rc1                                                          2725 2717 2690 2691 2710
> >> 3b0e81a1cdc9a mmap: change zeroing of maple tree in __vma_adjust()                   2736 2781 2748
> >> 524e00b36e8c5 mm: remove rb tree.                                                    2747 2744 2747
> >> 0c563f1480435 proc: remove VMA rbtree use from nommu
> >> d0cf3dd47f0d5 damon: convert __damon_va_three_regions to use the VMA iterator
> >> 3499a13168da6 mm/mmap: use maple tree for unmapped_area{_topdown}
> >> 7fdbd37da5c6f mm/mmap: use the maple tree for find_vma_prev() instead of the rbtree
> >> f39af05949a42 mm: add VMA iterator
> >> d4af56c5c7c67 mm: start tracking VMAs with maple tree
> >> e15e06a839232 lib/test_maple_tree: add testing for maple tree                        4638 4628 4502
> >> 9832fb87834e2 mm/demotion: expose memory tier details via sysfs                      4625 4509 4548
> >> 4fe89d07dcc28 Linux 6.0                                                              4385 4205 4348 4228 4504
> >>
> >>
> >> The first regression was between v6.0 and v6.1-rc1. The score dropped
> >> from 4600 to 2700, and bisected to the patches switching from rb tree to
> >> maple tree. This was reported at
> >> https://lore.kernel.org/oe-lkp/202212191714.524e00b3-yujie.liu@xxxxxxxxx/
> >> Thanks for the explanation that it is an expected regression as a trade
> >> off to benefit read performance.
> >>
> >> The second regression was between v6.1-rc7 and v6.1-rc8. The score
> >> dropped from 2700 to 2100, and bisected to this "Revert "mm: align larger
> >> anonymous mappings on THP boundaries"" commit.
> > So it means "mm: align larger anonymous mappings on THP boundaries"
> > actually improved the mmap1 benchmark? But it caused regression for
> > other usecase, for example, building kernel with clang, which is a
> > regression for a real life usecase.
> Yes. The patch "mm: align larger anonymous mappings on THP boundaries"
> can improve the mmap1 benchmark.
> 

If the aligned VMAs cannot be merged, then they do not need to be split
on freeing.  This means we are just allocating a new vma, write it in
the tree, removing it from the tree, free the vma.  We can do this 4600
times a second, apparently.

If the VMAs do get merged, we will go through __vma_adjust() to expand a
boundary, write it to the tree, allocate a new vma, __vma_adjust() the
vma boundary back, insert the new VMA that covers the boundary area,
remove the new vma from the tree, free the vma.  We can only do this
2700 times a second.  Note this is writing 3 times to the tree in this
loop vs 2 in the other option.

So yes, merging/splitting is more work and always has been.  We are
doing this to avoid having too many VMAs though.  There really isn't a
good reason an application would do this for any meaningful number of
iterations.

> For building kernel regression, looks like it's not related with the
> patch "mm: align larger anonymous mappings on THP boundaries" directly.
> It's another existing behavior more visible with the patch.
> https://lore.kernel.org/all/a4bcddad-e56f-cedc-891a-916e86d9a02c@xxxxxxxxx/
> 

Having a snapshot of the VMA layout would help here since the THP
boundary alignment may be changing if the VMAs can be merged or not.  I
suspect it is not able to merge and is fragmenting the VMA space which
would speed up this benchmark at the expense of having more VMAs.

Thanks,
Liam