Re: [PATCH rfc 0/3] mm: allow more high-order pages stored on PCP lists

David Hildenbrand <david@xxxxxxxxxx> · Tue, 16 Apr 2024 09:03:52 +0200

On 16.04.24 07:26, Barry Song wrote:
On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:

On 2024/4/16 12:50, Kefeng Wang wrote:

On 2024/4/16 8:21, Barry Song wrote:
On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
<wangkefeng.wang@xxxxxxxxxx> wrote:

On 2024/4/15 18:52, David Hildenbrand wrote:
On 15.04.24 10:59, Kefeng Wang wrote:

On 2024/4/15 16:18, Barry Song wrote:
On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
<wangkefeng.wang@xxxxxxxxxx> wrote:

Both the file pages and anonymous pages support large folio,
high-order
pages except PMD_ORDER will also be allocated frequently which could
increase the zone lock contention, allow high-order pages on pcp
lists
could reduce the big zone lock contention, but as commit
44042b449872
("mm/page_alloc: allow high-order pages to be stored on the per-cpu
lists")
pointed, it may not win in all the scenes, add a new control
sysfs to
enable or disable specified high-order pages stored on PCP lists,
the order
(PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
default.

This is precisely something Baolin and I have discussed and intended
to implement[1],
but unfortunately, we haven't had the time to do so.

Indeed, same thing. Recently, we are working on unixbench/lmbench
optimization, I tested Multi-size THP for anonymous memory by
hard-cord
PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
not for all cases and not very stable, so re-implemented it by
according
to the user requirement and enable it dynamically.

I'm wondering, though, if this is really a suitable candidate for a
sysctl toggle. Can anybody really come up with an educated guess for
these values?

Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
we could trace __alloc_pages() and do order statistic to decide to
choose the high-order to be enabled on PCP.

Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
and "it may not win in all the scenes", to me it mostly sounds like
"minimal impact" -- so who cares?

Even though lock conflicts are eliminated, there is very limited
performance improvement(even maybe fluctuation), it is not a good
testcase to show improvement, just show the zone-lock issue, we need to
find other better testcase, maybe some test on Andriod(heavy use 64K, no
PMD THP), or LKP maybe give some help?

I will try to find other testcase to show the benefit.

Hi Kefeng,

I wonder if you will see some major improvements on mTHP 64KiB using
the below microbench I wrote just now, for example perf and time to
finish the program

#define DATA_SIZE (2UL * 1024 * 1024)

int main(int argc, char **argv)
{
          /* make 32 concurrent alloc and free of mTHP */
          fork(); fork(); fork(); fork(); fork();

          for (int i = 0; i < 100000; i++) {
                  void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
PROT_WRITE,
                                  MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
                  if (addr == MAP_FAILED) {
                          perror("fail to malloc");
                          return -1;
                  }
                  memset(addr, 0x11, DATA_SIZE);
                  munmap(addr, DATA_SIZE);
          }

          return 0;
}

Rebased on next-20240415,

echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled

Compare with
    echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
    echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled

1) PCP disabled
      1    2    3    4    5    average
real    200.41    202.18    203.16    201.54    200.91    201.64
user    6.49    6.21    6.25    6.31    6.35    6.322
sys     193.3    195.39    196.3    194.65    194.01    194.73

2) PCP enabled
real    198.25    199.26    195.51    199.28    189.12    196.284
-2.66%
user    6.21    6.02    6.02    6.28    6.21    6.148       -2.75%
sys     191.46    192.64    188.96    192.47    182.39    189.584
-2.64%

for above test, time reduce 2.x%

This is an improvement from 0.28%, but it's still below my expectations.

Yes, it's noise. Maybe we need a system with more Cores/Sockets? But it 
does feel a bit like we're trying to come up with the problem after we 
have a solution; I'd have thought some existing benchmark could 
highlight if that is worth it.

--
Cheers,

David / dhildenb