Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists

Barry Song <21cnbao@xxxxxxxxx> · Tue, 16 Apr 2024 17:26:44 +1200

On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:
>
>
>
> On 2024/4/16 12:50, Kefeng Wang wrote:
> >
> >
> > On 2024/4/16 8:21, Barry Song wrote:
> >> On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
> >> <wangkefeng.wang@xxxxxxxxxx> wrote:
> >>>
> >>>
> >>>
> >>> On 2024/4/15 18:52, David Hildenbrand wrote:
> >>>> On 15.04.24 10:59, Kefeng Wang wrote:
> >>>>>
> >>>>>
> >>>>> On 2024/4/15 16:18, Barry Song wrote:
> >>>>>> On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
> >>>>>> <wangkefeng.wang@xxxxxxxxxx> wrote:
> >>>>>>>
> >>>>>>> Both the file pages and anonymous pages support large folio,
> >>>>>>> high-order
> >>>>>>> pages except PMD_ORDER will also be allocated frequently which could
> >>>>>>> increase the zone lock contention, allow high-order pages on pcp
> >>>>>>> lists
> >>>>>>> could reduce the big zone lock contention, but as commit
> >>>>>>> 44042b449872
> >>>>>>> ("mm/page_alloc: allow high-order pages to be stored on the per-cpu
> >>>>>>> lists")
> >>>>>>> pointed, it may not win in all the scenes, add a new control
> >>>>>>> sysfs to
> >>>>>>> enable or disable specified high-order pages stored on PCP lists,
> >>>>>>> the order
> >>>>>>> (PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
> >>>>>>> default.
> >>>>>>
> >>>>>> This is precisely something Baolin and I have discussed and intended
> >>>>>> to implement[1],
> >>>>>> but unfortunately, we haven't had the time to do so.
> >>>>>
> >>>>> Indeed, same thing. Recently, we are working on unixbench/lmbench
> >>>>> optimization, I tested Multi-size THP for anonymous memory by
> >>>>> hard-cord
> >>>>> PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
> >>>>> not for all cases and not very stable, so re-implemented it by
> >>>>> according
> >>>>> to the user requirement and enable it dynamically.
> >>>>
> >>>> I'm wondering, though, if this is really a suitable candidate for a
> >>>> sysctl toggle. Can anybody really come up with an educated guess for
> >>>> these values?
> >>>
> >>> Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
> >>> we could trace __alloc_pages() and do order statistic to decide to
> >>> choose the high-order to be enabled on PCP.
> >>>
> >>>>
> >>>> Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
> >>>> and "it may not win in all the scenes", to me it mostly sounds like
> >>>> "minimal impact" -- so who cares?
> >>>
> >>> Even though lock conflicts are eliminated, there is very limited
> >>> performance improvement(even maybe fluctuation), it is not a good
> >>> testcase to show improvement, just show the zone-lock issue, we need to
> >>> find other better testcase, maybe some test on Andriod(heavy use 64K, no
> >>> PMD THP), or LKP maybe give some help?
> >>>
> >>> I will try to find other testcase to show the benefit.
> >>
> >> Hi Kefeng,
> >>
> >> I wonder if you will see some major improvements on mTHP 64KiB using
> >> the below microbench I wrote just now, for example perf and time to
> >> finish the program
> >>
> >> #define DATA_SIZE (2UL * 1024 * 1024)
> >>
> >> int main(int argc, char **argv)
> >> {
> >>          /* make 32 concurrent alloc and free of mTHP */
> >>          fork(); fork(); fork(); fork(); fork();
> >>
> >>          for (int i = 0; i < 100000; i++) {
> >>                  void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
> >> PROT_WRITE,
> >>                                  MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> >>                  if (addr == MAP_FAILED) {
> >>                          perror("fail to malloc");
> >>                          return -1;
> >>                  }
> >>                  memset(addr, 0x11, DATA_SIZE);
> >>                  munmap(addr, DATA_SIZE);
> >>          }
> >>
> >>          return 0;
> >> }
> >>
>
> Rebased on next-20240415,
>
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>
> Compare with
>    echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>    echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
>
> >
> > 1) PCP disabled
> >      1    2    3    4    5    average
> > real    200.41    202.18    203.16    201.54    200.91    201.64
> > user    6.49    6.21    6.25    6.31    6.35    6.322
> > sys     193.3    195.39    196.3    194.65    194.01    194.73
> >
> > 2) PCP enabled
> > real    198.25    199.26    195.51    199.28    189.12    196.284
> > -2.66%
> > user    6.21    6.02    6.02    6.28    6.21    6.148       -2.75%
> > sys     191.46    192.64    188.96    192.47    182.39    189.584
> > -2.64%
> >
> > for above test, time reduce 2.x%

This is an improvement from 0.28%, but it's still below my expectations.
I suspect it's due to mTHP reducing the frequency of allocations and frees.
Running the same test on order-0 might yield much better results.

I suppose that as the order increases, PCP exhibits fewer improvements
since both allocation and release activities decrease.

Conversely, we also employ PCP for THP (2MB). Do we have any data
demonstrating that such large-size allocations can benefit from PCP
before ?

> >
> >
> > And re-test page_fault1(anon) from will-it-scale
> >
> > 1) PCP enabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1416915    98.95    1418128    98.95    1418128
> > 20    5327312    79.22    3821312    94.36    28362560
> > 40    9437184    58.58    4463657    94.55    56725120
> > 60    8120003    38.16    4736716    94.61    85087680
> > 80    7356508    18.29    4847824    94.46    113450240
> > 100    7256185    1.48    4870096    94.61    141812800
> >
> > 2) PCP disabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1365398    98.95    1354502    98.95    1365398
> > 20    5174918    79.22    3722368    94.65    27307960
> > 40    9094265    58.58    4427267    94.82    54615920
> > 60    8021606    38.18    4572896    94.93    81923880
> > 80    7497318    18.2    4637062    94.76    109231840
> > 100    6819897    1.47    4654521    94.63    136539800
> >
> > ------------------------------------
> > 1) vs 2)  pcp enabled improve 3.86%
> >
> > 3) PCP re-enabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1419036    98.96    1428403    98.95    1428403
> > 20    5356092    79.23    3851849    94.41    28568060
> > 40    9437184    58.58    4512918    94.63    57136120
> > 60    8252342    38.16    4659552    94.68    85704180
> > 80    7414899    18.26    4790576    94.77    114272240
> > 100    7062902    1.46    4759030    94.64    142840300
> >
> > 4) PCP re-disabled
> > tasks    processes    processes_idle    threads    threads_idle    linear
> > 0    0    100    0    100    0
> > 1    1352649    98.95    1354806    98.95    1354806
> > 20    5172924    79.22    3719292    94.64    27096120
> > 40    9174505    58.59    4310649    94.93    54192240
> > 60    8021606    38.17    4552960    94.81    81288360
> > 80    7497318    18.18    4671638    94.81    108384480
> > 100    6823926    1.47    4725955    94.64    135480600
> >
> > ------------------------------------
> > 3) vs 4)  pcp enabled improve 5.43%
> >
> > Average: 4.645%
> >
> >
> >
> >
> >
> >>>
> >>>>
> >>>> How much is the cost vs. benefit of just having one sane system
> >>>> configuration?
> >>>>
> >>>
> >>> For arm64 with 4k, five more high-orders(4~8), five more pcplists,
> >>> and for high-orders, we assumes most of them are moveable, but maybe
> >>> not, so enable it by default maybe more fragmentization, see
> >>> 5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
> >>> allocations").
> >>>

Thanks
Barry