Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2024/11/4 10:35, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:

On 2024/11/1 16:16, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:

On 2024/10/31 16:39, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
[snip]

1) Will test some rand test to check the different of performance as
David suggested.>>>>
2) Hope the LKP to run more tests since it is very useful(more test
set and different machines)
I'm starting to use LKP to test.

Greet.


Sorry for the late,

I have run some tests with LKP to test.
Firstly, there's almost no measurable difference between clearing
pages
from start to end or from end to start on Intel server CPU.  I guess
that there's some similar optimization for both direction.
For multiple processes (same as logical CPU number)
vm-scalability/anon-w-seq test case, the benchmark score increases
about 22.4%.

So process_huge_page is better than clear_gigantic_page() on Intel?
For vm-scalability/anon-w-seq test case, it is.  Because the
performance
of forward and backward clearing is almost same, and the user space
accessing has cache-hot benefit.

Could you test the following case on x86?
echo 10240 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
mkdir -p /hugetlbfs/
mount none /hugetlbfs/ -t hugetlbfs
rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
-d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
/hugetlbfs/test
It's not trivial for me to do this test.  Because 0day wraps test
cases.
Do you know which existing test cases provide this?  For example, in
vm-scalability?

I don't know the public fallocate test, I will try to find a intel
machine to test this case.

I don't expect it to change much, because we have observed that the
performance of forward and backward clearing is similar on Intel.

I find a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz

Caches (sum of all):
  L1d:                    1.1 MiB (36 instances)
  L1i:                    1.1 MiB (36 instances)
  L2:                     36 MiB (36 instances)
  L3:                     49.5 MiB (2 instances)
NUMA:
  NUMA node(s):           2
  NUMA node0 CPU(s):      0-17,36-53
  NUMA node1 CPU(s):      18-35,54-71


Before:

Performance counter stats for 'taskset -c 10 fallocate -l 20G /mnt/hugetlbfs/test':

3,856.93 msec task-clock # 0.997 CPUs utilized 6 context-switches # 1.556 /sec 1 cpu-migrations # 0.259 /sec 132 page-faults # 34.224 /sec 11,520,934,848 cycles # 2.987 GHz (19.95%) 213,731,011 instructions # 0.02 insn per cycle (24.96%) 58,164,361 branches # 15.080 M/sec (24.96%) 262,547 branch-misses # 0.45% of all branches (24.97%) 96,029,321 CPU_CLK_UNHALTED.REF_XCLK # 24.898 M/sec # 0.3 % tma_frontend_bound # 3.3 % tma_retiring # 96.4 % tma_backend_bound # 0.0 % tma_bad_speculation (24.99%) 149,735,020 IDQ_UOPS_NOT_DELIVERED.CORE # 38.822 M/sec (25.01%) 2,486,326 INT_MISC.RECOVERY_CYCLES_ANY # 644.638 K/sec (20.01%) 95,973,482 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.883 M/sec (20.01%) 11,526,783,305 CPU_CLK_UNHALTED.THREAD # 2.989 G/sec (20.01%) 1,519,072,911 UOPS_RETIRED.RETIRE_SLOTS # 393.855 M/sec (20.01%) 1,526,020,825 UOPS_ISSUED.ANY # 395.657 M/sec (20.01%) 59,784,189 L1-dcache-loads # 15.500 M/sec (20.01%) 337,479,254 L1-dcache-load-misses # 564.50% of all L1-dcache accesses (20.02%) 175,954 LLC-loads # 45.620 K/sec (20.02%) 51,955 LLC-load-misses # 29.53% of all L1-icache accesses (20.02%) <not supported> L1-icache-loads 2,864,230 L1-icache-load-misses (20.02%) 59,769,391 dTLB-loads # 15.497 M/sec (20.02%) 819 dTLB-load-misses # 0.00% of all dTLB cache accesses (20.02%) 2,459 iTLB-loads # 637.553 /sec (20.01%) 370 iTLB-load-misses # 15.05% of all iTLB cache accesses (19.98%)

       3.870393637 seconds time elapsed

       0.000000000 seconds user
       3.833021000 seconds sys

After(using clear_gigantic_page()):

Performance counter stats for 'taskset -c 10 fallocate -l 20G /mnt/hugetlbfs/test':

4,426.18 msec task-clock # 0.994 CPUs utilized 8 context-switches # 1.807 /sec 1 cpu-migrations # 0.226 /sec 131 page-faults # 29.597 /sec 13,221,263,588 cycles # 2.987 GHz (19.98%) 215,924,995 instructions # 0.02 insn per cycle (25.00%) 58,430,182 branches # 13.201 M/sec (25.01%) 279,381 branch-misses # 0.48% of all branches (25.03%) 110,199,114 CPU_CLK_UNHALTED.REF_XCLK # 24.897 M/sec # 0.3 % tma_frontend_bound # 2.9 % tma_retiring # 96.8 % tma_backend_bound # 0.0 % tma_bad_speculation (25.06%) 160,650,548 IDQ_UOPS_NOT_DELIVERED.CORE # 36.296 M/sec (25.07%) 2,559,970 INT_MISC.RECOVERY_CYCLES_ANY # 578.370 K/sec (20.05%) 110,229,402 CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE # 24.904 M/sec (20.05%) 13,227,924,727 CPU_CLK_UNHALTED.THREAD # 2.989 G/sec (20.03%) 1,525,019,287 UOPS_RETIRED.RETIRE_SLOTS # 344.545 M/sec (20.01%) 1,531,307,263 UOPS_ISSUED.ANY # 345.966 M/sec (19.98%) 60,600,471 L1-dcache-loads # 13.691 M/sec (19.96%) 337,576,917 L1-dcache-load-misses # 557.05% of all L1-dcache accesses (19.96%) 177,157 LLC-loads # 40.025 K/sec (19.96%) 48,056 LLC-load-misses # 27.13% of all L1-icache accesses (19.97%) <not supported> L1-icache-loads 2,653,617 L1-icache-load-misses (19.97%) 60,609,241 dTLB-loads # 13.693 M/sec (19.97%) 530 dTLB-load-misses # 0.00% of all dTLB cache accesses (19.97%) 1,952 iTLB-loads # 441.013 /sec (19.97%) 3,059 iTLB-load-misses # 156.71% of all iTLB cache accesses (19.97%)

       4.450664421 seconds time elapsed

       0.000984000 seconds user
       4.397795000 seconds sys


This shows the backward is better than forward,at least for this CPU.




For multiple processes vm-scalability/anon-w-rand test case, no
measurable difference for benchmark score.
So, the optimization helps sequential workload mainly.
In summary, on x86, process_huge_page() will not introduce any
regression.  And it helps some workload.
However, on ARM64, it does introduce some regression for clearing
pages
from end to start.  That needs to be addressed.  I guess that the
regression can be resolved via using more clearing from start to end
(but not all).  For example, can you take a look at the patch below?
Which uses the similar framework as before, but clear each small trunk
(mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
the regression can be restored.
WARNING: the patch is only build tested.


Base: baseline
Change1: using clear_gigantic_page() for 2M PMD
Change2: your patch with MPAGE_NRPAGES=16
Change3: Case3 + fix[1]
What is case3?

Oh, it is Change2.

Got it.


Change4: your patch with MPAGE_NRPAGES=64 + fix[1]

1. For rand write,
     case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference

2. For seq write,

1) case-anon-w-seq-mt:
Can you try case-anon-w-seq?  That may be more stable.

base:
real    0m2.490s    0m2.254s    0m2.272s
user    1m59.980s   2m23.431s   2m18.739s
sys     1m3.675s    1m15.462s   1m15.030s

Change1:
real    0m2.234s    0m2.225s    0m2.159s
user    2m56.105s   2m57.117s   3m0.489s
sys     0m17.064s   0m17.564s   0m16.150s

Change2:
real	0m2.244s    0m2.384s	0m2.370s
user	2m39.413s   2m41.990s   2m42.229s
sys	0m19.826s   0m18.491s   0m18.053s
It appears strange.  There's no much cache hot benefit even if we
clear
pages from end to begin (with larger chunk).
However, sys time improves a lot.  This shows clearing page with
large
chunk helps on ARM64.

Change3:  // best performance
real	0m2.155s    0m2.204s	0m2.194s
user	3m2.640s    2m55.837s   3m0.902s
sys	0m17.346s   0m17.630s   0m18.197s

Change4:
real	0m2.287s    0m2.377s	0m2.284s	
user	2m37.030s   2m52.868s   3m17.593s
sys	0m15.445s   0m34.430s   0m45.224s
Change4 is essentially same as Change1.  I don't know why they are
different.  Is there some large variation among run to run?

As above shown, I test three times, the test results are relatively
stable, at least for real, I will try case-anon-w-seq.

Can you also show the score of vm-scalability?

TBH, I cannot understand your results.  For example, why there are
measurable difference between Change3 and Change4?  In both cases, the
kernel clears pages from start to end.

OK,will retest once I can access the machine again.


Can you further optimize the prototype patch below?  I think that it
has
potential to fix your issue.

Yes, thanks for you helper, but this will make process_huge_page() a
little more complicated :)

IMHO, we should try to root cause it, then try to find the proper
solution and optimize (simplifies) it.

From the above fallocate test on intel, it seems that different
microarchitectures have different performance on Intel too.




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux