On 2024/4/16 15:03, David Hildenbrand wrote:
On 16.04.24 07:26, Barry Song wrote:
On Tue, Apr 16, 2024 at 4:58 PM Kefeng Wang
<wangkefeng.wang@xxxxxxxxxx> wrote:
On 2024/4/16 12:50, Kefeng Wang wrote:
On 2024/4/16 8:21, Barry Song wrote:
On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
<wangkefeng.wang@xxxxxxxxxx> wrote:
On 2024/4/15 18:52, David Hildenbrand wrote:
On 15.04.24 10:59, Kefeng Wang wrote:
On 2024/4/15 16:18, Barry Song wrote:
On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
<wangkefeng.wang@xxxxxxxxxx> wrote:
Both the file pages and anonymous pages support large folio,
high-order
pages except PMD_ORDER will also be allocated frequently which
could
increase the zone lock contention, allow high-order pages on pcp
lists
could reduce the big zone lock contention, but as commit
44042b449872
("mm/page_alloc: allow high-order pages to be stored on the
per-cpu
lists")
pointed, it may not win in all the scenes, add a new control
sysfs to
enable or disable specified high-order pages stored on PCP lists,
the order
(PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP
list by
default.
This is precisely something Baolin and I have discussed and
intended
to implement[1],
but unfortunately, we haven't had the time to do so.
Indeed, same thing. Recently, we are working on unixbench/lmbench
optimization, I tested Multi-size THP for anonymous memory by
hard-cord
PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some
improvement but
not for all cases and not very stable, so re-implemented it by
according
to the user requirement and enable it dynamically.
I'm wondering, though, if this is really a suitable candidate for a
sysctl toggle. Can anybody really come up with an educated guess for
these values?
Not sure this is suitable in sysctl, but mTHP anon is enabled in
sysctl,
we could trace __alloc_pages() and do order statistic to decide to
choose the high-order to be enabled on PCP.
Especially reading "Benchmarks Score shows a little
improvoment(0.28%)"
and "it may not win in all the scenes", to me it mostly sounds like
"minimal impact" -- so who cares?
Even though lock conflicts are eliminated, there is very limited
performance improvement(even maybe fluctuation), it is not a good
testcase to show improvement, just show the zone-lock issue, we
need to
find other better testcase, maybe some test on Andriod(heavy use
64K, no
PMD THP), or LKP maybe give some help?
I will try to find other testcase to show the benefit.
Hi Kefeng,
I wonder if you will see some major improvements on mTHP 64KiB using
the below microbench I wrote just now, for example perf and time to
finish the program
#define DATA_SIZE (2UL * 1024 * 1024)
int main(int argc, char **argv)
{
/* make 32 concurrent alloc and free of mTHP */
fork(); fork(); fork(); fork(); fork();
for (int i = 0; i < 100000; i++) {
void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("fail to malloc");
return -1;
}
memset(addr, 0x11, DATA_SIZE);
munmap(addr, DATA_SIZE);
}
return 0;
}
Rebased on next-20240415,
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
Compare with
echo 0 >
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
echo 1 >
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
1) PCP disabled
1 2 3 4 5 average
real 200.41 202.18 203.16 201.54 200.91 201.64
user 6.49 6.21 6.25 6.31 6.35 6.322
sys 193.3 195.39 196.3 194.65 194.01 194.73
2) PCP enabled
real 198.25 199.26 195.51 199.28 189.12 196.284
-2.66%
user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75%
sys 191.46 192.64 188.96 192.47 182.39 189.584
-2.64%
for above test, time reduce 2.x%
This is an improvement from 0.28%, but it's still below my expectations.
Yes, it's noise. Maybe we need a system with more Cores/Sockets? But it
does feel a bit like we're trying to come up with the problem after we
have a solution; I'd have thought some existing benchmark could
highlight if that is worth it.
96 core, with 129 threads, a quick test with pcp_enabled to control
hugepages-2048KB, it is no big improvement on 2M
PCP enabled
1 2 3 average
real 221.8 225.6 221.5 222.9666667
user 14.91 14.91 17.05 15.62333333
sys 141.91 159.25 156.23 152.4633333
PCP disabled
real 230.76 231.39 228.39 230.18
user 15.47 15.88 17.5 16.28333333
sys 159.07 162.32 159.09 160.16
From 44042b449872 ("mm/page_alloc: allow high-order pages to be stored
on the per-cpu lists"), it seems limited improve,
netperf-udp
5.13.0-rc2 5.13.0-rc2
mm-pcpburst-v3r4 mm-pcphighorder-v1r7
Hmean send-64 261.46 ( 0.00%) 266.30 * 1.85%*
Hmean send-128 516.35 ( 0.00%) 536.78 * 3.96%*
Hmean send-256 1014.13 ( 0.00%) 1034.63 * 2.02%*
Hmean send-1024 3907.65 ( 0.00%) 4046.11 * 3.54%*
Hmean send-2048 7492.93 ( 0.00%) 7754.85 * 3.50%*
Hmean send-3312 11410.04 ( 0.00%) 11772.32 * 3.18%*
Hmean send-4096 13521.95 ( 0.00%) 13912.34 * 2.89%*
Hmean send-8192 21660.50 ( 0.00%) 22730.72 * 4.94%*
Hmean send-16384 31902.32 ( 0.00%) 32637.50 * 2.30%*