Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()

Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> · Tue, 5 Nov 2024 10:06:19 +0800

On 2024/11/4 10:35, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:

On 2024/11/1 16:16, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:

On 2024/10/31 16:39, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:
[snip]

1) Will test some rand test to check the different of performance as
David suggested.>>>>
2) Hope the LKP to run more tests since it is very useful(more test
set and different machines)
I'm starting to use LKP to test.

Greet.

Sorry for the late,

I have run some tests with LKP to test.
Firstly, there's almost no measurable difference between clearing
pages
from start to end or from end to start on Intel server CPU.  I guess
that there's some similar optimization for both direction.
For multiple processes (same as logical CPU number)
vm-scalability/anon-w-seq test case, the benchmark score increases
about 22.4%.

So process_huge_page is better than clear_gigantic_page() on Intel?
For vm-scalability/anon-w-seq test case, it is.  Because the
performance
of forward and backward clearing is almost same, and the user space
accessing has cache-hot benefit.

Could you test the following case on x86?
echo 10240 >
/sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages
mkdir -p /hugetlbfs/
mount none /hugetlbfs/ -t hugetlbfs
rm -f /hugetlbfs/test && fallocate -l 20G /hugetlbfs/test && fallocate
-d -l 20G /hugetlbfs/test && time taskset -c 10 fallocate -l 20G
/hugetlbfs/test
It's not trivial for me to do this test.  Because 0day wraps test
cases.
Do you know which existing test cases provide this?  For example, in
vm-scalability?

I don't know the public fallocate test, I will try to find a intel
machine to test this case.

I don't expect it to change much, because we have observed that the
performance of forward and backward clearing is similar on Intel.

I find a Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz

Caches (sum of all):
  L1d:                    1.1 MiB (36 instances)
  L1i:                    1.1 MiB (36 instances)
  L2:                     36 MiB (36 instances)
  L3:                     49.5 MiB (2 instances)
NUMA:
  NUMA node(s):           2
  NUMA node0 CPU(s):      0-17,36-53
  NUMA node1 CPU(s):      18-35,54-71

Before:

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

          3,856.93 msec task-clock                       #    0.997 
CPUs utilized
                 6      context-switches                 #    1.556 
/sec
                 1      cpu-migrations                   #    0.259 
/sec
               132      page-faults                      #   34.224 
/sec
    11,520,934,848      cycles                           #    2.987 GHz 
                        (19.95%)
       213,731,011      instructions                     #    0.02 
insn per cycle              (24.96%)
        58,164,361      branches                         #   15.080 
M/sec                       (24.96%)
           262,547      branch-misses                    #    0.45% of 
all branches             (24.97%)
        96,029,321      CPU_CLK_UNHALTED.REF_XCLK        #   24.898 
M/sec
                                                  #      0.3 % 
tma_frontend_bound
                                                  #      3.3 % 
tma_retiring
                                                  #     96.4 % 
tma_backend_bound
                                                  #      0.0 % 
tma_bad_speculation      (24.99%)
       149,735,020      IDQ_UOPS_NOT_DELIVERED.CORE      #   38.822 
M/sec                       (25.01%)
         2,486,326      INT_MISC.RECOVERY_CYCLES_ANY     #  644.638 
K/sec                       (20.01%)
        95,973,482      CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE #   24.883 
M/sec                       (20.01%)
    11,526,783,305      CPU_CLK_UNHALTED.THREAD          #    2.989 
G/sec                       (20.01%)
     1,519,072,911      UOPS_RETIRED.RETIRE_SLOTS        #  393.855 
M/sec                       (20.01%)
     1,526,020,825      UOPS_ISSUED.ANY                  #  395.657 
M/sec                       (20.01%)
        59,784,189      L1-dcache-loads                  #   15.500 
M/sec                       (20.01%)
       337,479,254      L1-dcache-load-misses            #  564.50% of 
all L1-dcache accesses   (20.02%)
           175,954      LLC-loads                        #   45.620 
K/sec                       (20.02%)
            51,955      LLC-load-misses                  #   29.53% of 
all L1-icache accesses   (20.02%)
   <not supported>      L1-icache-loads 

         2,864,230      L1-icache-load-misses 
                        (20.02%)
        59,769,391      dTLB-loads                       #   15.497 
M/sec                       (20.02%)
               819      dTLB-load-misses                 #    0.00% of 
all dTLB cache accesses  (20.02%)
             2,459      iTLB-loads                       #  637.553 
/sec                        (20.01%)
               370      iTLB-load-misses                 #   15.05% of 
all iTLB cache accesses  (19.98%)

       3.870393637 seconds time elapsed

       0.000000000 seconds user
       3.833021000 seconds sys

After(using clear_gigantic_page()):

Performance counter stats for 'taskset -c 10 fallocate -l 20G 
/mnt/hugetlbfs/test':

          4,426.18 msec task-clock                       #    0.994 
CPUs utilized
                 8      context-switches                 #    1.807 
/sec
                 1      cpu-migrations                   #    0.226 
/sec
               131      page-faults                      #   29.597 
/sec
    13,221,263,588      cycles                           #    2.987 GHz 
                        (19.98%)
       215,924,995      instructions                     #    0.02 
insn per cycle              (25.00%)
        58,430,182      branches                         #   13.201 
M/sec                       (25.01%)
           279,381      branch-misses                    #    0.48% of 
all branches             (25.03%)
       110,199,114      CPU_CLK_UNHALTED.REF_XCLK        #   24.897 
M/sec
                                                  #      0.3 % 
tma_frontend_bound
                                                  #      2.9 % 
tma_retiring
                                                  #     96.8 % 
tma_backend_bound
                                                  #      0.0 % 
tma_bad_speculation      (25.06%)
       160,650,548      IDQ_UOPS_NOT_DELIVERED.CORE      #   36.296 
M/sec                       (25.07%)
         2,559,970      INT_MISC.RECOVERY_CYCLES_ANY     #  578.370 
K/sec                       (20.05%)
       110,229,402      CPU_CLK_UNHALTED.ONE_THREAD_ACTIVE #   24.904 
M/sec                       (20.05%)
    13,227,924,727      CPU_CLK_UNHALTED.THREAD          #    2.989 
G/sec                       (20.03%)
     1,525,019,287      UOPS_RETIRED.RETIRE_SLOTS        #  344.545 
M/sec                       (20.01%)
     1,531,307,263      UOPS_ISSUED.ANY                  #  345.966 
M/sec                       (19.98%)
        60,600,471      L1-dcache-loads                  #   13.691 
M/sec                       (19.96%)
       337,576,917      L1-dcache-load-misses            #  557.05% of 
all L1-dcache accesses   (19.96%)
           177,157      LLC-loads                        #   40.025 
K/sec                       (19.96%)
            48,056      LLC-load-misses                  #   27.13% of 
all L1-icache accesses   (19.97%)
   <not supported>      L1-icache-loads 

         2,653,617      L1-icache-load-misses 
                        (19.97%)
        60,609,241      dTLB-loads                       #   13.693 
M/sec                       (19.97%)
               530      dTLB-load-misses                 #    0.00% of 
all dTLB cache accesses  (19.97%)
             1,952      iTLB-loads                       #  441.013 
/sec                        (19.97%)
             3,059      iTLB-load-misses                 #  156.71% of 
all iTLB cache accesses  (19.97%)

       4.450664421 seconds time elapsed

       0.000984000 seconds user
       4.397795000 seconds sys

This shows the backward is better than forward，at least for this CPU.

For multiple processes vm-scalability/anon-w-rand test case, no
measurable difference for benchmark score.
So, the optimization helps sequential workload mainly.
In summary, on x86, process_huge_page() will not introduce any
regression.  And it helps some workload.
However, on ARM64, it does introduce some regression for clearing
pages
from end to start.  That needs to be addressed.  I guess that the
regression can be resolved via using more clearing from start to end
(but not all).  For example, can you take a look at the patch below?
Which uses the similar framework as before, but clear each small trunk
(mpage) from start to end.  You can adjust MPAGE_NRPAGES to check when
the regression can be restored.
WARNING: the patch is only build tested.

Base: baseline
Change1: using clear_gigantic_page() for 2M PMD
Change2: your patch with MPAGE_NRPAGES=16
Change3: Case3 + fix[1]
What is case3?

Oh, it is Change2.

Got it.

Change4: your patch with MPAGE_NRPAGES=64 + fix[1]

1. For rand write,
     case-anon-w-rand/case-anon-w-rand-hugetlb no measurable difference

2. For seq write,

1) case-anon-w-seq-mt:
Can you try case-anon-w-seq?  That may be more stable.

base:
real    0m2.490s    0m2.254s    0m2.272s
user    1m59.980s   2m23.431s   2m18.739s
sys     1m3.675s    1m15.462s   1m15.030s

Change1:
real    0m2.234s    0m2.225s    0m2.159s
user    2m56.105s   2m57.117s   3m0.489s
sys     0m17.064s   0m17.564s   0m16.150s

Change2：
real	0m2.244s    0m2.384s	0m2.370s
user	2m39.413s   2m41.990s   2m42.229s
sys	0m19.826s   0m18.491s   0m18.053s
It appears strange.  There's no much cache hot benefit even if we
clear
pages from end to begin (with larger chunk).
However, sys time improves a lot.  This shows clearing page with
large
chunk helps on ARM64.

Change3：  // best performance
real	0m2.155s    0m2.204s	0m2.194s
user	3m2.640s    2m55.837s   3m0.902s
sys	0m17.346s   0m17.630s   0m18.197s

Change4：
real	0m2.287s    0m2.377s	0m2.284s	
user	2m37.030s   2m52.868s   3m17.593s
sys	0m15.445s   0m34.430s   0m45.224s
Change4 is essentially same as Change1.  I don't know why they are
different.  Is there some large variation among run to run?

As above shown, I test three times, the test results are relatively
stable, at least for real, I will try case-anon-w-seq.

Can you also show the score of vm-scalability?

TBH, I cannot understand your results.  For example, why there are
measurable difference between Change3 and Change4?  In both cases, the
kernel clears pages from start to end.

OK，will retest once I can access the machine again.

Can you further optimize the prototype patch below?  I think that it
has
potential to fix your issue.

Yes, thanks for you helper, but this will make process_huge_page() a
little more complicated :)

IMHO, we should try to root cause it, then try to find the proper
solution and optimize (simplifies) it.

From the above fallocate test on intel, it seems that different
microarchitectures have different performance on Intel too.