On 2024/4/16 12:50, Kefeng Wang wrote:
On 2024/4/16 8:21, Barry Song wrote:
On Tue, Apr 16, 2024 at 12:18 AM Kefeng Wang
<wangkefeng.wang@xxxxxxxxxx> wrote:
On 2024/4/15 18:52, David Hildenbrand wrote:
On 15.04.24 10:59, Kefeng Wang wrote:
On 2024/4/15 16:18, Barry Song wrote:
On Mon, Apr 15, 2024 at 8:12 PM Kefeng Wang
<wangkefeng.wang@xxxxxxxxxx> wrote:
Both the file pages and anonymous pages support large folio,
high-order
pages except PMD_ORDER will also be allocated frequently which could
increase the zone lock contention, allow high-order pages on pcp
lists
could reduce the big zone lock contention, but as commit
44042b449872
("mm/page_alloc: allow high-order pages to be stored on the per-cpu
lists")
pointed, it may not win in all the scenes, add a new control
sysfs to
enable or disable specified high-order pages stored on PCP lists,
the order
(PAGE_ALLOC_COSTLY_ORDER, PMD_ORDER) won't be stored on PCP list by
default.
This is precisely something Baolin and I have discussed and intended
to implement[1],
but unfortunately, we haven't had the time to do so.
Indeed, same thing. Recently, we are working on unixbench/lmbench
optimization, I tested Multi-size THP for anonymous memory by
hard-cord
PAGE_ALLOC_COSTLY_ORDER from 3 to 4[1], it shows some improvement but
not for all cases and not very stable, so re-implemented it by
according
to the user requirement and enable it dynamically.
I'm wondering, though, if this is really a suitable candidate for a
sysctl toggle. Can anybody really come up with an educated guess for
these values?
Not sure this is suitable in sysctl, but mTHP anon is enabled in sysctl,
we could trace __alloc_pages() and do order statistic to decide to
choose the high-order to be enabled on PCP.
Especially reading "Benchmarks Score shows a little improvoment(0.28%)"
and "it may not win in all the scenes", to me it mostly sounds like
"minimal impact" -- so who cares?
Even though lock conflicts are eliminated, there is very limited
performance improvement(even maybe fluctuation), it is not a good
testcase to show improvement, just show the zone-lock issue, we need to
find other better testcase, maybe some test on Andriod(heavy use 64K, no
PMD THP), or LKP maybe give some help?
I will try to find other testcase to show the benefit.
Hi Kefeng,
I wonder if you will see some major improvements on mTHP 64KiB using
the below microbench I wrote just now, for example perf and time to
finish the program
#define DATA_SIZE (2UL * 1024 * 1024)
int main(int argc, char **argv)
{
/* make 32 concurrent alloc and free of mTHP */
fork(); fork(); fork(); fork(); fork();
for (int i = 0; i < 100000; i++) {
void *addr = mmap(NULL, DATA_SIZE, PROT_READ |
PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
if (addr == MAP_FAILED) {
perror("fail to malloc");
return -1;
}
memset(addr, 0x11, DATA_SIZE);
munmap(addr, DATA_SIZE);
}
return 0;
}
Rebased on next-20240415,
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
Compare with
echo 0 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
echo 1 > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/pcp_enabled
1) PCP disabled
1 2 3 4 5 average
real 200.41 202.18 203.16 201.54 200.91 201.64
user 6.49 6.21 6.25 6.31 6.35 6.322
sys 193.3 195.39 196.3 194.65 194.01 194.73
2) PCP enabled
real 198.25 199.26 195.51 199.28 189.12 196.284
-2.66%
user 6.21 6.02 6.02 6.28 6.21 6.148 -2.75%
sys 191.46 192.64 188.96 192.47 182.39 189.584
-2.64%
for above test, time reduce 2.x%
And re-test page_fault1(anon) from will-it-scale
1) PCP enabled
tasks processes processes_idle threads threads_idle linear
0 0 100 0 100 0
1 1416915 98.95 1418128 98.95 1418128
20 5327312 79.22 3821312 94.36 28362560
40 9437184 58.58 4463657 94.55 56725120
60 8120003 38.16 4736716 94.61 85087680
80 7356508 18.29 4847824 94.46 113450240
100 7256185 1.48 4870096 94.61 141812800
2) PCP disabled
tasks processes processes_idle threads threads_idle linear
0 0 100 0 100 0
1 1365398 98.95 1354502 98.95 1365398
20 5174918 79.22 3722368 94.65 27307960
40 9094265 58.58 4427267 94.82 54615920
60 8021606 38.18 4572896 94.93 81923880
80 7497318 18.2 4637062 94.76 109231840
100 6819897 1.47 4654521 94.63 136539800
------------------------------------
1) vs 2) pcp enabled improve 3.86%
3) PCP re-enabled
tasks processes processes_idle threads threads_idle linear
0 0 100 0 100 0
1 1419036 98.96 1428403 98.95 1428403
20 5356092 79.23 3851849 94.41 28568060
40 9437184 58.58 4512918 94.63 57136120
60 8252342 38.16 4659552 94.68 85704180
80 7414899 18.26 4790576 94.77 114272240
100 7062902 1.46 4759030 94.64 142840300
4) PCP re-disabled
tasks processes processes_idle threads threads_idle linear
0 0 100 0 100 0
1 1352649 98.95 1354806 98.95 1354806
20 5172924 79.22 3719292 94.64 27096120
40 9174505 58.59 4310649 94.93 54192240
60 8021606 38.17 4552960 94.81 81288360
80 7497318 18.18 4671638 94.81 108384480
100 6823926 1.47 4725955 94.64 135480600
------------------------------------
3) vs 4) pcp enabled improve 5.43%
Average: 4.645%
How much is the cost vs. benefit of just having one sane system
configuration?
For arm64 with 4k, five more high-orders(4~8), five more pcplists,
and for high-orders, we assumes most of them are moveable, but maybe
not, so enable it by default maybe more fragmentization, see
5d0a661d808f ("mm/page_alloc: use only one PCP list for THP-sized
allocations").