Re: [PATCH v2 1/2] mm: use aligned address in clear_gigantic_page()

Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> · Wed, 30 Oct 2024 13:05:33 +0800

On 2024/10/30 11:21, Huang, Ying wrote:
Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> writes:

On 2024/10/30 9:04, Huang, Ying wrote:
David Hildenbrand <david@xxxxxxxxxx> writes:

On 29.10.24 14:04, Kefeng Wang wrote:

That should all be cleaned up ... process_huge_page() likely
shouldn't

Yes, let's fix the bug firstly,

be even consuming "nr_pages".

No sure about this part, it uses nr_pages as the end and calculate
the
'base'.

It should be using folio_nr_pages().

But process_huge_page() without an explicit folio argument, I'd like to
move the aligned address calculate into the folio_zero_user and
copy_user_large_folio(will rename it to folio_copy_user()) in the
following cleanup patches, or do it in the fix patches?

First, why does folio_zero_user() call process_huge_page() for *a small
folio*? Because we like or code to be extra complicated to understand?
Or am I missing something important?

The folio_zero_user() used for PMD-sized THP and HugeTLB before, and
after anon mTHP supported, it is used for order-2~order-PMD-order THP
and HugeTLB, so it won't process a small folio if I understand correctly.

And unfortunately neither the documentation nor the function name
expresses that :(

I'm happy to review any patches that improve the situation here :)

Actually, could we drop the process_huge_page() totally, from my
testcase[1], process_huge_page() is not better than clear/copy page
from start to last, and sequential clearing/copying maybe more
beneficial to the hardware prefetching, and is there a way to let lkp
to test to check the performance, since the process_huge_page()
was submitted by Ying, what's your opinion?
I don't think that it's a good idea to revert the commit without
studying and root causing the issues.  I can work together with you on
that.  If we have solid and well explained data to prove
process_huge_page() isn't benefitial, we can revert the commit.

Take 'fallocate 20G' as an example， before

Performance counter stats for 'taskset -c 10 fallocate -l 20G
/mnt/hugetlbfs/test':

IIUC, fallocate will zero pages, but will not touch them at all, right?
If so, no cache benefit from clearing referenced page last.

Yes, for this case, only clear page.

           3,118.94 msec task-clock                #    0.999 CPUs
           utilized
                 30      context-switches          #    0.010 K/sec
                 1      cpu-migrations            #    0.000 K/sec
                 136      page-faults               #    0.044 K/sec
                 8,092,075,873      cycles                    #
                 2.594 GHz                (92.82%)
      1,624,587,663      instructions              #    0.20  insn per
      cycle           (92.83%)
        395,341,850      branches                  #  126.755 M/sec
        (92.82%)
          3,872,302      branch-misses             #    0.98% of all
          branches          (92.83%)
      1,398,066,701      L1-dcache-loads           #  448.251 M/sec
      (92.82%)
         58,124,626      L1-dcache-load-misses     #    4.16% of all
         L1-dcache accesses  (92.82%)
          1,032,527      LLC-loads                 #    0.331 M/sec
          (92.82%)
            498,684      LLC-load-misses           #   48.30% of all
            LL-cache accesses  (92.84%)
        473,689,004      L1-icache-loads           #  151.875 M/sec
        (92.82%)
            356,721      L1-icache-load-misses     #    0.08% of all
            L1-icache accesses  (92.85%)
      1,947,644,987      dTLB-loads                #  624.458 M/sec
      (92.95%)
             10,185      dTLB-load-misses          #    0.00% of all
             dTLB cache accesses  (92.96%)
        474,622,896      iTLB-loads                #  152.174 M/sec
        (92.95%)
                 94      iTLB-load-misses          #    0.00% of all
                 iTLB cache accesses  (85.69%)

        3.122844830 seconds time elapsed

        0.000000000 seconds user
        3.107259000 seconds sys

and after（clear from start to end）

Performance counter stats for 'taskset -c 10 fallocate -l 20G
/mnt/hugetlbfs/test':

           1,135.53 msec task-clock                #    0.999 CPUs
           utilized
                 10      context-switches          #    0.009 K/sec
                 1      cpu-migrations            #    0.001 K/sec
                 137      page-faults               #    0.121 K/sec
                 2,946,673,587      cycles                    #
                 2.595 GHz                (92.67%)
      1,620,704,205      instructions              #    0.55  insn per
      cycle           (92.61%)
        394,595,772      branches                  #  347.499 M/sec
        (92.60%)
            130,756      branch-misses             #    0.03% of all
            branches          (92.84%)
      1,396,726,689      L1-dcache-loads           # 1230.022 M/sec
      (92.96%)
            338,344      L1-dcache-load-misses     #    0.02% of all
            L1-dcache accesses  (92.95%)
            111,737      LLC-loads                 #    0.098 M/sec
            (92.96%)
             67,486      LLC-load-misses           #   60.40% of all
             LL-cache accesses  (92.96%)
        418,198,663      L1-icache-loads           #  368.285 M/sec
        (92.96%)
            173,764      L1-icache-load-misses     #    0.04% of all
            L1-icache accesses  (92.96%)
      2,203,364,632      dTLB-loads                # 1940.385 M/sec
      (92.96%)
             17,195      dTLB-load-misses          #    0.00% of all
             dTLB cache accesses  (92.95%)
        418,198,365      iTLB-loads                #  368.285 M/sec
        (92.96%)
                 79      iTLB-load-misses          #    0.00% of all
                 iTLB cache accesses  (85.34%)

        1.137015760 seconds time elapsed

        0.000000000 seconds user
        1.131266000 seconds sys

The IPC improved a lot，less LLC-loads and more L1-dcache-loads， but
this depends on the implementation of the microarchitecture.

Anyway, we need to avoid (or reduce at least) the pure memory clearing
performance.  Have you double checked whether process_huge_page() is
inlined?  Perf-profile result can be used to check this too.

Yes, I'm sure the process_huge_page() is inlined.

When you say from start to end, you mean to use clear_gigantic_page()
directly, or change process_huge_page() to clear page from start to end?

Using clear_gigantic_page() and changing process_huge_page() to clear 
page from start to end are both good for performance when sequential 
clearing, but no random test so far.

1) Will test some rand test to check the different of performance as
David suggested.

2) Hope the LKP to run more tests since it is very useful(more test
set and different machines)

I'm starting to use LKP to test.

Greet.