Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists

Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> · Tue, 16 Apr 2024 16:06:33 +0800

On 2024/4/16 15:03, David Hildenbrand wrote:
On 16.04.24 07:26, Barry Song wrote:
On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang 
<wangkefeng.wang@xxxxxxxxxx> wrote:

On 2024/4/16 12:50, Kefeng Wang wrote:

On 2024/4/16 8:21, Barry Song wrote:
On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
<wangkefeng.wang@xxxxxxxxxx> wrote:

On 2024/4/15 18:52, David Hildenbrand wrote:
On 15.04.24 10:59, Kefeng Wang wrote:

On 2024/4/15 16:18, Barry Song wrote:
On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
<wangkefeng.wang@xxxxxxxxxx> wrote:

Both the file pages and anonymous pages support large folio,
high-order
pages except PMD_ORDER will also be allocated frequently which 
could
increase the zone lock contention, allow high-order pages on pcp
lists
could reduce the big zone lock contention, but as commit
44042b449872
("mm/page_alloc: allow high-order pages to be stored on the 
per-cpu
lists")
pointed, it may not win in all the scenes, add a new control
sysfs to
enable or disable specified high-order pages stored on PCP lists,
the order
(PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP 
list by
default.

This is precisely something Baolin and I have discussed and 
intended
to implement[1],
but unfortunately, we haven't had the time to do so.

Indeed, same thing. Recently, we are working on unixbench/lmbench
optimization, I tested Multi-size THP for anonymous memory by
hard-cord
PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some 
improvement but
not for all cases and not very stable, so re-implemented it by
according
to the user requirement and enable it dynamically.

I'm wondering, though, if this is really a suitable candidate for a
sysctl toggle. Can anybody really come up with an educated guess for
these values?

Not sure this is suitable in sysctl, but mTHP anon is enabled in 
sysctl,
we could trace __alloc_pages() and do order statistic to decide to
choose the high-order to be enabled on PCP.

Especially reading "Benchmarks Score shows a little 
improvoment(0.28%)"
and "it may not win in all the scenes", to me it mostly sounds like
"minimal impact" -- so who cares?

Even though lock conflicts are eliminated, there is very limited
performance improvement(even maybe fluctuation), it is not a good
testcase to show improvement, just show the zone-lock issue, we 
need to
find other better testcase, maybe some test on Andriod(heavy use 
64K, no
PMD THP), or LKP maybe give some help?

I will try to find other testcase to show the benefit.

Hi Kefeng,

I wonder if you will see some major improvements on mTHP 64KiB using
the below microbench I wrote just now, for example perf and time to
finish the program

#define DATA_SIZE (2UL * 1024 * 1024)

int main(int argc, char **argv)
{
          /* make 32 concurrent alloc and free of mTHP */
          fork(); fork(); fork(); fork(); fork();

          for (int i = 0; i < 100000; i++) {
                  void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
PROT_WRITE,
                                  MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
                  if (addr == MAP_FAILED) {
                          perror("fail to malloc");
                          return -1;
                  }
                  memset(addr, 0x11, DATA_SIZE);
                  munmap(addr, DATA_SIZE);
          }

          return 0;
}

Rebased on next-20240415,

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled

Compare with
    echo 0 > 
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
    echo 1 > 
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled

1) PCP disabled
      1    2    3    4    5    average
real    200.41    202.18    203.16    201.54    200.91    201.64
user    6.49    6.21    6.25    6.31    6.35    6.322
sys     193.3    195.39    196.3    194.65    194.01    194.73

2) PCP enabled
real    198.25    199.26    195.51    199.28    189.12    196.284
-2.66%
user    6.21    6.02    6.02    6.28    6.21    6.148       -2.75%
sys     191.46    192.64    188.96    192.47    182.39    189.584
-2.64%

for above test, time reduce 2.x%

This is an improvement from 0.28%, but it's still below my expectations.

Yes, it's noise. Maybe we need a system with more Cores/Sockets? But it 
does feel a bit like we're trying to come up with the problem after we 
have a solution; I'd have thought some existing benchmark could 
highlight if that is worth it.

96 core, with 129 threads, a quick test with pcp_enabled to control
hugepages-2048KB, it is no big improvement on 2M

PCP enabled
	1	2	3	average
real	221.8	225.6	221.5	222.9666667
user	14.91	14.91	17.05	15.62333333
sys 	141.91	159.25	156.23	152.4633333

PCP disabled				
real	230.76	231.39	228.39	230.18
user	15.47	15.88	17.5	16.28333333
sys 	159.07	162.32	159.09	160.16

From 44042b449872 ("mm/page_alloc: allow high-order pages to be stored
on the per-cpu lists"), it seems limited improve,

 netperf-udp
                                  5.13.0-rc2             5.13.0-rc2
                            mm-pcpburst-v3r4   mm-pcphighorder-v1r7
 Hmean     send-64         261.46 (   0.00%)      266.30 *   1.85%*
 Hmean     send-128        516.35 (   0.00%)      536.78 *   3.96%*
 Hmean     send-256       1014.13 (   0.00%)     1034.63 *   2.02%*
 Hmean     send-1024      3907.65 (   0.00%)     4046.11 *   3.54%*
 Hmean     send-2048      7492.93 (   0.00%)     7754.85 *   3.50%*
 Hmean     send-3312     11410.04 (   0.00%)    11772.32 *   3.18%*
 Hmean     send-4096     13521.95 (   0.00%)    13912.34 *   2.89%*
 Hmean     send-8192     21660.50 (   0.00%)    22730.72 *   4.94%*
 Hmean     send-16384    31902.32 (   0.00%)    32637.50 *   2.30%*