Re: [PATCH 0/9] x86/clear_huge_page: multi-page clearing

Raghavendra K T <raghavendra.kt@xxxxxxx> · Thu, 6 Apr 2023 01:18:41 +0530

On 4/3/2023 10:52 AM, Ankur Arora wrote:
This series introduces multi-page clearing for hugepages.

This is a follow up of some of the ideas discussed at:
   https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@xxxxxxxxxxxxxx/

On x86 page clearing is typically done via string intructions. These,
unlike a MOV loop, allow us to explicitly advertise the region-size to
the processor, which could serve as a hint to current (and/or
future) uarchs to elide cacheline allocation.

In current generation processors, Milan (and presumably other Zen
variants) use the hint to elide cacheline allocation (for
region-size > LLC-size.)

An additional reason for doing this is that string instructions are typically
microcoded, and clearing in bigger chunks than the current page-at-a-
time logic amortizes some of the cost.

All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance.

There are, however, some problems:

1. extended zeroing periods means there's an increased latency due to
    the now missing preemption points.

    That's handled in patches 7, 8, 9:
      "sched: define TIF_ALLOW_RESCHED"
      "irqentry: define irqentry_exit_allow_resched()"
      "x86/clear_huge_page: make clear_contig_region() preemptible"
    by the context marking itself reschedulable, and rescheduling in
    irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.)

2. the current page-at-a-time clearing logic does left-right narrowing
    towards the faulting page which benefits workloads by maintaining
    cache locality for workloads which have a sequential pattern. Clearing
    in large chunks loses that.

    Some (but not all) of that could be ameliorated by something like
    this patch:
    https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@xxxxxxxxxx/

    But, before doing that I'd like some comments on whether that is
    worth doing for this specific use case?

Rest of the series:
   Patches 1, 2, 3:
     "huge_pages: get rid of process_huge_page()"
     "huge_page: get rid of {clear,copy}_subpage()"
     "huge_page: allow arch override for clear/copy_huge_page()"
   are mechanical and they simplify some of the current clear_huge_page()
   logic.

   Patches 4, 5:
   "x86/clear_page: parameterize clear_page*() to specify length"
   "x86/clear_pages: add clear_pages()"

   add clear_pages() and helpers.

   Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the
   chunked x86 clear_huge_page() implementation.

Performance
==

Demand fault performance gets a decent boost:

   *Icelakex*  mm/clear_huge_page   x86/clear_huge_page   change
                           (GB/s)                (GB/s)

   pg-sz=2MB                 8.76                 11.82   +34.93%
   pg-sz=1GB                 8.99                 12.18   +35.48%

   *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
                           (GB/s)                (GB/s)

   pg-sz=2MB                12.24                 17.54    +43.30%
   pg-sz=1GB                17.98                 37.24   +107.11%

vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs
worse when user space tries to touch those pages:

   *Icelakex*                  mm/clear_huge_page   x86/clear_huge_page   change
   (mem=4GB/task, tasks=128)

   stime                           293.02 +- .49%        239.39 +- .83%   -18.30%
   utime                           440.11 +- .28%        508.74 +- .60%   +15.59%
   wall-clock                        5.96 +- .33%          6.27 +-2.23%   + 5.20%

   *Milan*                     mm/clear_huge_page   x86/clear_huge_page   change
   (mem=1GB/task, tasks=512)

   stime                          490.95 +- 3.55%       466.90 +- 4.79%   - 4.89%
   utime                          276.43 +- 2.85%       311.97 +- 5.15%   +12.85%
   wall-clock                       3.74 +- 6.41%         3.58 +- 7.82%   - 4.27%

Also at:
   github.com/terminus/linux clear-pages.v1

Comments appreciated!

Hello Ankur,

Was able to test your patches. To summarize, am seeing 2x-3x perf
improvement for 2M, 1GB base hugepage sizes.

SUT: Genoa AMD EPYC
   Thread(s) per core:  2
   Core(s) per socket:  128
   Socket(s):           2

NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-127,256-383
  NUMA node1 CPU(s):     128-255,384-511

Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA 
node0), for both base-hugepage-size=2M and 1GB

perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>

time in seconds elapsed (average of 10 runs) (lower = better)

Result:
page-size  mm/clear_huge_page   x86/clear_huge_page     change %
2M              5.4567          2.6774                  -50.93
1G              2.64452         1.011281                -61.76

Full perfstat info

 page size = 2M mm/clear_huge_page

 Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 
runs):

          5,434.71 msec task-clock                #    0.996 CPUs 
utilized            ( +-  0.55% )
                 8      context-switches          #    1.466 /sec 
               ( +-  4.66% )
                 0      cpu-migrations            #    0.000 /sec
            32,918      page-faults               #    6.034 K/sec 
               ( +-  0.00% )
    16,977,242,482      cycles                    #    3.112 GHz 
               ( +-  0.04% )  (35.70%)
         1,961,724      stalled-cycles-frontend   #    0.01% frontend 
cycles idle     ( +-  1.09% )  (35.72%)
        35,685,674      stalled-cycles-backend    #    0.21% backend 
cycles idle      ( +-  3.48% )  (35.74%)
     1,038,327,182      instructions              #    0.06  insn per cycle
                                                  #    0.04  stalled 
cycles per insn  ( +-  0.38% )  (35.75%)
       221,409,216      branches                  #   40.584 M/sec 
               ( +-  0.36% )  (35.75%)
           350,730      branch-misses             #    0.16% of all 
branches          ( +-  1.18% )  (35.75%)
     2,520,888,779      L1-dcache-loads           #  462.077 M/sec 
               ( +-  0.03% )  (35.73%)
     1,094,178,209      L1-dcache-load-misses     #   43.46% of all 
L1-dcache accesses  ( +-  0.02% )  (35.71%)
        67,751,730      L1-icache-loads           #   12.419 M/sec 
               ( +-  0.11% )  (35.70%)
           271,118      L1-icache-load-misses     #    0.40% of all 
L1-icache accesses  ( +-  2.55% )  (35.70%)
           506,635      dTLB-loads                #   92.866 K/sec 
               ( +-  3.31% )  (35.70%)
           237,385      dTLB-load-misses          #   43.64% of all 
dTLB cache accesses  ( +-  7.00% )  (35.69%)
               268      iTLB-load-misses          # 6700.00% of all 
iTLB cache accesses  ( +- 13.86% )  (35.70%)

            5.4567 +- 0.0300 seconds time elapsed  ( +-  0.55% )

 page size = 2M x86/clear_huge_page
 Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_2M' (10 
runs):

          2,780.69 msec task-clock                #    1.039 CPUs 
utilized            ( +-  1.03% )
                 3      context-switches          #    1.121 /sec 
               ( +- 21.34% )
                 0      cpu-migrations            #    0.000 /sec
            32,918      page-faults               #   12.301 K/sec 
               ( +-  0.00% )
     8,143,619,771      cycles                    #    3.043 GHz 
               ( +-  0.25% )  (35.62%)
         2,024,872      stalled-cycles-frontend   #    0.02% frontend 
cycles idle     ( +-320.93% )  (35.66%)
       717,198,728      stalled-cycles-backend    #    8.82% backend 
cycles idle      ( +-  8.26% )  (35.69%)
       606,549,334      instructions              #    0.07  insn per cycle
                                                  #    1.39  stalled 
cycles per insn  ( +-  0.23% )  (35.73%)
       108,856,550      branches                  #   40.677 M/sec 
               ( +-  0.24% )  (35.76%)
           202,490      branch-misses             #    0.18% of all 
branches          ( +-  3.58% )  (35.78%)
     2,348,818,806      L1-dcache-loads           #  877.701 M/sec 
               ( +-  0.03% )  (35.78%)
     1,081,562,988      L1-dcache-load-misses     #   46.04% of all 
L1-dcache accesses  ( +-  0.01% )  (35.78%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
        43,411,167      L1-icache-loads           #   16.222 M/sec 
               ( +-  0.19% )  (35.77%)
           273,042      L1-icache-load-misses     #    0.64% of all 
L1-icache accesses  ( +-  4.94% )  (35.76%)
           834,482      dTLB-loads                #  311.827 K/sec 
               ( +-  9.73% )  (35.72%)
           437,343      dTLB-load-misses          #   65.86% of all 
dTLB cache accesses  ( +-  8.56% )  (35.68%)
                 0      iTLB-loads                #    0.000 /sec 
               (35.65%)
               160      iTLB-load-misses          # 1777.78% of all 
iTLB cache accesses  ( +- 15.82% )  (35.62%)

            2.6774 +- 0.0287 seconds time elapsed  ( +-  1.07% )

 page size = 1G mm/clear_huge_page
 Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 
runs):

          2,625.24 msec task-clock                #    0.993 CPUs 
utilized            ( +-  0.23% )
                 4      context-switches          #    1.513 /sec 
               ( +-  4.49% )
                 1      cpu-migrations            #    0.378 /sec
               214      page-faults               #   80.965 /sec 
               ( +-  0.13% )
     8,178,624,349      cycles                    #    3.094 GHz 
               ( +-  0.23% )  (35.65%)
         2,942,576      stalled-cycles-frontend   #    0.04% frontend 
cycles idle     ( +- 75.22% )  (35.69%)
         7,117,425      stalled-cycles-backend    #    0.09% backend 
cycles idle      ( +-  3.79% )  (35.73%)
       454,521,647      instructions              #    0.06  insn per cycle
                                                  #    0.02  stalled 
cycles per insn  ( +-  0.10% )  (35.77%)
       113,223,853      branches                  #   42.837 M/sec 
               ( +-  0.08% )  (35.80%)
            84,766      branch-misses             #    0.07% of all 
branches          ( +-  5.37% )  (35.80%)
     2,294,528,890      L1-dcache-loads           #  868.111 M/sec 
               ( +-  0.02% )  (35.81%)
     1,075,907,551      L1-dcache-load-misses     #   46.88% of all 
L1-dcache accesses  ( +-  0.02% )  (35.78%)
        26,167,323      L1-icache-loads           #    9.900 M/sec 
               ( +-  0.24% )  (35.74%)
           139,675      L1-icache-load-misses     #    0.54% of all 
L1-icache accesses  ( +-  0.37% )  (35.70%)
             3,459      dTLB-loads                #    1.309 K/sec 
               ( +- 12.75% )  (35.67%)
               732      dTLB-load-misses          #   19.71% of all 
dTLB cache accesses  ( +- 26.61% )  (35.62%)
                11      iTLB-load-misses          #  192.98% of all 
iTLB cache accesses  ( +-238.28% )  (35.62%)

           2.64452 +- 0.00600 seconds time elapsed  ( +-  0.23% )

 page size = 1G x86/clear_huge_page
 Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 
runs):

          1,009.09 msec task-clock                #    0.998 CPUs 
utilized            ( +-  0.06% )
                 2      context-switches          #    1.980 /sec 
               ( +- 23.63% )
                 1      cpu-migrations            #    0.990 /sec
               214      page-faults               #  211.887 /sec 
               ( +-  0.16% )
     3,154,980,463      cycles                    #    3.124 GHz 
               ( +-  0.06% )  (35.77%)
           145,051      stalled-cycles-frontend   #    0.00% frontend 
cycles idle     ( +-  6.26% )  (35.78%)
       730,087,143      stalled-cycles-backend    #   23.12% backend 
cycles idle      ( +-  9.75% )  (35.78%)
        45,813,391      instructions              #    0.01  insn per cycle
                                                  #   18.51  stalled 
cycles per insn  ( +-  1.00% )  (35.78%)
         8,498,282      branches                  #    8.414 M/sec 
               ( +-  1.54% )  (35.78%)
            63,351      branch-misses             #    0.74% of all 
branches          ( +-  6.70% )  (35.69%)
        29,135,863      L1-dcache-loads           #   28.848 M/sec 
               ( +-  5.67% )  (35.68%)
         8,537,280      L1-dcache-load-misses     #   28.66% of all 
L1-dcache accesses  ( +- 10.15% )  (35.68%)
         1,040,087      L1-icache-loads           #    1.030 M/sec 
               ( +-  1.60% )  (35.68%)
             9,147      L1-icache-load-misses     #    0.85% of all 
L1-icache accesses  ( +-  6.50% )  (35.67%)
             1,084      dTLB-loads                #    1.073 K/sec 
               ( +- 12.05% )  (35.68%)
               431      dTLB-load-misses          #   40.28% of all 
dTLB cache accesses  ( +- 43.46% )  (35.68%)
                16      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  ( +- 40.54% )  (35.68%)

          1.011281 +- 0.000624 seconds time elapsed  ( +-  0.06% )

Please feel free to add

Tested-by: Raghavendra K T <raghavendra.kt@xxxxxxx>

Will come back with further observations on patch/performance if any

Thanks and Regards