Re: [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing

Raghavendra K T <raghavendra.kt@xxxxxxx> · Tue, 5 Sep 2023 06:36:33 +0530

On 8/31/2023 12:19 AM, Ankur Arora wrote:
This series adds a multi-page clearing primitive, clear_pages(),
which enables more effective use of x86 string instructions by
advertising the real region-size to be cleared.

Region-size can be used as a hint by uarchs to optimize the
clearing.

Also add allow_resched() which marks a code-section as allowing
rescheduling in the irqentry_exit path. This allows clear_pages()
to get by without having to call cond_sched() periodically.
(preempt_model_full() already handles this via
irqentry_exit_cond_resched(), so we handle this similarly for
preempt_model_none() and preempt_model_voluntary().)

Hello Ankur,
Thansk for the patches.

I tried the patches, Improvements look similar to V1 (even without
circuitous chunk optimizations.)
STill we see similar 50-60% improvement for 1G and 2M page sizes.

SUT: Bergamo
    CPU family:          25
    Model:               160
    Thread(s) per core:  2
    Core(s) per socket:  128
    Socket(s):           2

NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-127,256-383
  NUMA node1 CPU(s):     128-255,384-511

Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA 
node0), for both base-hugepage-size=2M and 1GB
Current result is with thp = always, but madv also did not make much 
difference.

perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>

time in seconds elapsed (average of 10 runs) (lower = better)

Result:
base: mm/clear_huge_page
patched: x86/clear_huge_page

page-size  base       patched     Improvement %
2M         5.0779     2.50623     50.64
1G         2.50623    1.012439    59.60

More details:

 Performance counter stats for 'mm/map_hugetlb' (10 runs):

          5,058.71 msec task-clock                #    0.996 CPUs 
utilized            ( +-  0.26% )
                 8      context-switches          #    1.576 /sec 
               ( +-  7.23% )
                 0      cpu-migrations            #    0.000 /sec
            32,917      page-faults               #    6.484 K/sec 
               ( +-  0.00% )
    15,797,804,067      cycles                    #    3.112 GHz 
               ( +-  0.26% )  (35.70%)
         2,073,754      stalled-cycles-frontend   #    0.01% frontend 
cycles idle     ( +-  1.25% )  (35.71%)
        27,508,977      stalled-cycles-backend    #    0.17% backend 
cycles idle      ( +-  9.48% )  (35.74%)
     1,143,710,651      instructions              #    0.07  insn per cycle
                                                  #    0.03  stalled 
cycles per insn  ( +-  0.15% )  (35.76%)
       243,817,330      branches                  #   48.028 M/sec 
               ( +-  0.12% )  (35.78%)
           357,760      branch-misses             #    0.15% of all 
branches          ( +-  1.52% )  (35.75%)
     2,540,733,497      L1-dcache-loads           #  500.483 M/sec 
               ( +-  0.04% )  (35.74%)
     1,093,660,557      L1-dcache-load-misses     #   42.98% of all 
L1-dcache accesses  ( +-  0.03% )  (35.71%)
        73,335,478      L1-icache-loads           #   14.446 M/sec 
               ( +-  0.08% )  (35.70%)
           878,378      L1-icache-load-misses     #    1.19% of all 
L1-icache accesses  ( +-  2.65% )  (35.68%)
         1,025,714      dTLB-loads                #  202.049 K/sec 
               ( +-  2.70% )  (35.69%)
           405,407      dTLB-load-misses          #   37.35% of all 
dTLB cache accesses  ( +-  1.59% )  (35.68%)
                 2      iTLB-loads                #    0.394 /sec 
               ( +- 41.63% )  (35.68%)
            40,356      iTLB-load-misses          # 1552153.85% of all 
iTLB cache accesses  ( +-  7.18% )  (35.68%)

            5.0779 +- 0.0132 seconds time elapsed  ( +-  0.26% )

 Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb' (10 
runs):

          2,538.40 msec task-clock                #    1.013 CPUs 
utilized            ( +-  0.27% )
                 4      context-switches          #    1.597 /sec 
               ( +-  6.51% )
                 1      cpu-migrations            #    0.399 /sec
            32,916      page-faults               #   13.140 K/sec 
               ( +-  0.00% )
     7,901,830,782      cycles                    #    3.154 GHz 
               ( +-  0.27% )  (35.67%)
         6,590,473      stalled-cycles-frontend   #    0.08% frontend 
cycles idle     ( +- 10.31% )  (35.71%)
       329,970,288      stalled-cycles-backend    #    4.23% backend 
cycles idle      ( +- 13.65% )  (35.74%)
       725,811,962      instructions              #    0.09  insn per cycle
                                                  #    0.80  stalled 
cycles per insn  ( +-  0.37% )  (35.78%)
       132,182,704      branches                  #   52.767 M/sec 
               ( +-  0.26% )  (35.82%)
           254,163      branch-misses             #    0.19% of all 
branches          ( +-  2.47% )  (35.81%)
     2,382,927,453      L1-dcache-loads           #  951.262 M/sec 
               ( +-  0.04% )  (35.77%)
     1,082,022,067      L1-dcache-load-misses     #   45.41% of all 
L1-dcache accesses  ( +-  0.02% )  (35.74%)
        47,164,491      L1-icache-loads           #   18.828 M/sec 
               ( +-  0.37% )  (35.70%)
           474,535      L1-icache-load-misses     #    0.99% of all 
L1-icache accesses  ( +-  2.93% )  (35.66%)
         1,477,334      dTLB-loads                #  589.750 K/sec 
               ( +-  5.12% )  (35.65%)
           624,125      dTLB-load-misses          #   56.24% of all 
dTLB cache accesses  ( +-  5.66% )  (35.65%)
                 0      iTLB-loads                #    0.000 /sec 
               (35.65%)
             1,626      iTLB-load-misses          # 7069.57% of all 
iTLB cache accesses  ( +-283.51% )  (35.65%)

           2.50623 +- 0.00691 seconds time elapsed  ( +-  0.28% )

 Performance counter stats for 'numactl -m 0 -N 0 mm/map_hugetlb_1G' 
(10 runs):

          2,506.50 msec task-clock                #    0.995 CPUs 
utilized            ( +-  0.17% )
                 4      context-switches          #    1.589 /sec 
               ( +-  9.28% )
                 0      cpu-migrations            #    0.000 /sec
               214      page-faults               #   84.997 /sec 
               ( +-  0.13% )
     7,821,519,053      cycles                    #    3.107 GHz 
               ( +-  0.17% )  (35.72%)
         2,037,744      stalled-cycles-frontend   #    0.03% frontend 
cycles idle     ( +- 25.62% )  (35.73%)
         6,578,899      stalled-cycles-backend    #    0.08% backend 
cycles idle      ( +-  2.65% )  (35.73%)
       468,648,780      instructions              #    0.06  insn per cycle
                                                  #    0.01  stalled 
cycles per insn  ( +-  0.10% )  (35.73%)
       116,267,370      branches                  #   46.179 M/sec 
               ( +-  0.08% )  (35.73%)
           111,966      branch-misses             #    0.10% of all 
branches          ( +-  2.98% )  (35.72%)
     2,294,727,165      L1-dcache-loads           #  911.424 M/sec 
               ( +-  0.02% )  (35.71%)
     1,076,156,463      L1-dcache-load-misses     #   46.88% of all 
L1-dcache accesses  ( +-  0.01% )  (35.70%)
        26,093,151      L1-icache-loads           #   10.364 M/sec 
               ( +-  0.21% )  (35.71%)
           132,944      L1-icache-load-misses     #    0.51% of all 
L1-icache accesses  ( +-  0.55% )  (35.70%)
            30,925      dTLB-loads                #   12.283 K/sec 
               ( +-  5.70% )  (35.71%)
            27,437      dTLB-load-misses          #   86.22% of all 
dTLB cache accesses  ( +-  1.98% )  (35.70%)
                 0      iTLB-loads                #    0.000 /sec 
               (35.71%)
                11      iTLB-load-misses          #   62.50% of all 
iTLB cache accesses  ( +-140.21% )  (35.70%)

           2.51890 +- 0.00433 seconds time elapsed  ( +-  0.17% )

 Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb_1G' 
(10 runs):

          1,013.59 msec task-clock                #    1.001 CPUs 
utilized            ( +-  0.07% )
                 2      context-switches          #    1.978 /sec 
               ( +- 12.91% )
                 1      cpu-migrations            #    0.989 /sec
               213      page-faults               #  210.634 /sec 
               ( +-  0.17% )
     3,169,391,694      cycles                    #    3.134 GHz 
               ( +-  0.07% )  (35.53%)
           109,925      stalled-cycles-frontend   #    0.00% frontend 
cycles idle     ( +-  5.56% )  (35.63%)
       950,638,913      stalled-cycles-backend    #   30.06% backend 
cycles idle      ( +-  5.06% )  (35.73%)
        51,189,571      instructions              #    0.02  insn per cycle
                                                  #   21.03  stalled 
cycles per insn  ( +-  1.22% )  (35.82%)
         9,545,941      branches                  #    9.440 M/sec 
               ( +-  1.50% )  (35.92%)
            86,836      branch-misses             #    0.88% of all 
branches          ( +-  3.74% )  (36.00%)
        46,109,587      L1-dcache-loads           #   45.597 M/sec 
               ( +-  3.92% )  (35.96%)
        13,796,172      L1-dcache-load-misses     #   41.77% of all 
L1-dcache accesses  ( +-  4.81% )  (35.85%)
         1,179,166      L1-icache-loads           #    1.166 M/sec 
               ( +-  1.22% )  (35.77%)
            21,528      L1-icache-load-misses     #    1.90% of all 
L1-icache accesses  ( +-  1.85% )  (35.66%)
            14,529      dTLB-loads                #   14.368 K/sec 
               ( +-  4.65% )  (35.57%)
             8,505      dTLB-load-misses          #   67.88% of all 
dTLB cache accesses  ( +-  5.61% )  (35.52%)
                 0      iTLB-loads                #    0.000 /sec 
               (35.52%)
                 8      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  ( +-267.99% )  (35.52%)

          1.012439 +- 0.000723 seconds time elapsed  ( +-  0.07% )

Please feel free to carry:

Tested-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
for any minor changes.

Thanks and Regards
- Raghu