On 2024/11/4 10:35, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
On 2024/11/1 16:16, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
On 2024/10/31 16:39, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
[snip]
1) Will test some rand test to check the different of performance as
David suggested.>>>>
2) Hope the LKP to run more tests since it is very useful(more test
set and different machines)
I'm starting to use LKP to test.
Greet.
Sorry for the late,
I have run some tests with LKP to test.
Firstly, there's almost no measurable difference between clearing
pages
from start to end or from end to start on Intel server CPU. I guess
that there's some similar optimization for both direction.
For multiple processes (same as logical CPU number)
vm-scalability/anon-w-seq test case, the benchmark score increases
about 22.4%.
So process_huge_page is better than clear_gigantic_page() on Intel?
For vm-scalability/anon-w-seq test case, it is. Because the
performance
of forward and backward clearing is almost same, and the user space
accessing has cache-hot benefit.
Could you test the following case on x86?
echo 10240 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
mkdir -p /hugetlbfs/
mount none /hugetlbfs/ -t hugetlbfs
rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
-d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
/hugetlbfs/test
It's not trivial for me to do this test. Because 0day wraps test
cases.
Do you know which existing test cases provide this? For example, in
vm-scalability?
I don't know the public fallocate test, I will try to find a intel
machine to test this case.
I don't expect it to change much, because we have observed that the
performance of forward and backward clearing is similar on Intel.
I find a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
Caches (sum of all):
L1d: 1.1 MiB (36 instances)
L1i: 1.1 MiB (36 instances)
L2: 36 MiB (36 instances)
L3: 49.5 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-17,36-53
NUMA node1 CPU(s): 18-35,54-71
Before:
Performance counter stats for 'taskset -c 10 fallocate -l 20G
/mnt/hugetlbfs/test':
3,856.93 msec task-clock # 0.997
CPUs utilized
6 context-switches # 1.556
/sec
1 cpu-migrations # 0.259
/sec
132 page-faults # 34.224
/sec
11,520,934,848 cycles # 2.987 GHz
(19.95%)
213,731,011 instructions # 0.02
insn per cycle (24.96%)
58,164,361 branches # 15.080
M/sec (24.96%)
262,547 branch-misses # 0.45% of
all branches (24.97%)
96,029,321 CPU_CLK_UNHALTED.REF_XCLK # 24.898
M/sec
# 0.3 %
tma_frontend_bound
# 3.3 %
tma_retiring
# 96.4 %
tma_backend_bound
# 0.0 %
tma_bad_speculation (24.99%)
149,735,020 IDQ_UOPS_NOT_DELIVERED.CORE # 38.822
M/sec (25.01%)
2,486,326 INT_MISC.RECOVERY_CYCLES_ANY # 644.638
K/sec (20.01%)
95,973,482 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.883
M/sec (20.01%)
11,526,783,305 CPU_CLK_UNHALTED.THREAD # 2.989
G/sec (20.01%)
1,519,072,911 UOPS_RETIRED.RETIRE_SLOTS # 393.855
M/sec (20.01%)
1,526,020,825 UOPS_ISSUED.ANY # 395.657
M/sec (20.01%)
59,784,189 L1-dcache-loads # 15.500
M/sec (20.01%)
337,479,254 L1-dcache-load-misses # 564.50% of
all L1-dcache accesses (20.02%)
175,954 LLC-loads # 45.620
K/sec (20.02%)
51,955 LLC-load-misses # 29.53% of
all L1-icache accesses (20.02%)
<not supported> L1-icache-loads
2,864,230 L1-icache-load-misses
(20.02%)
59,769,391 dTLB-loads # 15.497
M/sec (20.02%)
819 dTLB-load-misses # 0.00% of
all dTLB cache accesses (20.02%)
2,459 iTLB-loads # 637.553
/sec (20.01%)
370 iTLB-load-misses # 15.05% of
all iTLB cache accesses (19.98%)
3.870393637 seconds time elapsed
0.000000000 seconds user
3.833021000 seconds sys
After(using clear_gigantic_page()):
Performance counter stats for 'taskset -c 10 fallocate -l 20G
/mnt/hugetlbfs/test':
4,426.18 msec task-clock # 0.994
CPUs utilized
8 context-switches # 1.807
/sec
1 cpu-migrations # 0.226
/sec
131 page-faults # 29.597
/sec
13,221,263,588 cycles # 2.987 GHz
(19.98%)
215,924,995 instructions # 0.02
insn per cycle (25.00%)
58,430,182 branches # 13.201
M/sec (25.01%)
279,381 branch-misses # 0.48% of
all branches (25.03%)
110,199,114 CPU_CLK_UNHALTED.REF_XCLK # 24.897
M/sec
# 0.3 %
tma_frontend_bound
# 2.9 %
tma_retiring
# 96.8 %
tma_backend_bound
# 0.0 %
tma_bad_speculation (25.06%)
160,650,548 IDQ_UOPS_NOT_DELIVERED.CORE # 36.296
M/sec (25.07%)
2,559,970 INT_MISC.RECOVERY_CYCLES_ANY # 578.370
K/sec (20.05%)
110,229,402 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.904
M/sec (20.05%)
13,227,924,727 CPU_CLK_UNHALTED.THREAD # 2.989
G/sec (20.03%)
1,525,019,287 UOPS_RETIRED.RETIRE_SLOTS # 344.545
M/sec (20.01%)
1,531,307,263 UOPS_ISSUED.ANY # 345.966
M/sec (19.98%)
60,600,471 L1-dcache-loads # 13.691
M/sec (19.96%)
337,576,917 L1-dcache-load-misses # 557.05% of
all L1-dcache accesses (19.96%)
177,157 LLC-loads # 40.025
K/sec (19.96%)
48,056 LLC-load-misses # 27.13% of
all L1-icache accesses (19.97%)
<not supported> L1-icache-loads
2,653,617 L1-icache-load-misses
(19.97%)
60,609,241 dTLB-loads # 13.693
M/sec (19.97%)
530 dTLB-load-misses # 0.00% of
all dTLB cache accesses (19.97%)
1,952 iTLB-loads # 441.013
/sec (19.97%)
3,059 iTLB-load-misses # 156.71% of
all iTLB cache accesses (19.97%)
4.450664421 seconds time elapsed
0.000984000 seconds user
4.397795000 seconds sys
This shows the backward is better than forward,at least for this CPU.
For multiple processes vm-scalability/anon-w-rand test case, no
measurable difference for benchmark score.
So, the optimization helps sequential workload mainly.
In summary, on x86, process_huge_page() will not introduce any
regression. And it helps some workload.
However, on ARM64, it does introduce some regression for clearing
pages
from end to start. That needs to be addressed. I guess that the
regression can be resolved via using more clearing from start to end
(but not all). For example, can you take a look at the patch below?
Which uses the similar framework as before, but clear each small trunk
(mpage) from start to end. You can adjust MPAGE_NRPAGES to check when
the regression can be restored.
WARNING: the patch is only build tested.
Base: baseline
Change1: using clear_gigantic_page() for 2M PMD
Change2: your patch with MPAGE_NRPAGES=16
Change3: Case3 + fix[1]
What is case3?
Oh, it is Change2.
Got it.
Change4: your patch with MPAGE_NRPAGES=64 + fix[1]
1. For rand write,
case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference
2. For seq write,
1) case-anon-w-seq-mt:
Can you try case-anon-w-seq? That may be more stable.
base:
real 0m2.490s 0m2.254s 0m2.272s
user 1m59.980s 2m23.431s 2m18.739s
sys 1m3.675s 1m15.462s 1m15.030s
Change1:
real 0m2.234s 0m2.225s 0m2.159s
user 2m56.105s 2m57.117s 3m0.489s
sys 0m17.064s 0m17.564s 0m16.150s
Change2:
real 0m2.244s 0m2.384s 0m2.370s
user 2m39.413s 2m41.990s 2m42.229s
sys 0m19.826s 0m18.491s 0m18.053s
It appears strange. There's no much cache hot benefit even if we
clear
pages from end to begin (with larger chunk).
However, sys time improves a lot. This shows clearing page with
large
chunk helps on ARM64.
Change3: // best performance
real 0m2.155s 0m2.204s 0m2.194s
user 3m2.640s 2m55.837s 3m0.902s
sys 0m17.346s 0m17.630s 0m18.197s
Change4:
real 0m2.287s 0m2.377s 0m2.284s
user 2m37.030s 2m52.868s 3m17.593s
sys 0m15.445s 0m34.430s 0m45.224s
Change4 is essentially same as Change1. I don't know why they are
different. Is there some large variation among run to run?
As above shown, I test three times, the test results are relatively
stable, at least for real, I will try case-anon-w-seq.
Can you also show the score of vm-scalability?
TBH, I cannot understand your results. For example, why there are
measurable difference between Change3 and Change4? In both cases, the
kernel clears pages from start to end.
OK,will retest once I can access the machine again.
Can you further optimize the prototype patch below? I think that it
has
potential to fix your issue.
Yes, thanks for you helper, but this will make process_huge_page() a
little more complicated :)
IMHO, we should try to root cause it, then try to find the proper
solution and optimize (simplifies) it.
From the above fallocate test on intel, it seems that different
microarchitectures have different performance on Intel too.