Re: [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8/31/2023 12:19 AM, Ankur Arora wrote:
This series adds a multi-page clearing primitive, clear_pages(),
which enables more effective use of x86 string instructions by
advertising the real region-size to be cleared.

Region-size can be used as a hint by uarchs to optimize the
clearing.

Also add allow_resched() which marks a code-section as allowing
rescheduling in the irqentry_exit path. This allows clear_pages()
to get by without having to call cond_sched() periodically.
(preempt_model_full() already handles this via
irqentry_exit_cond_resched(), so we handle this similarly for
preempt_model_none() and preempt_model_voluntary().)



Hello Ankur,
Thansk for the patches.

I tried the patches, Improvements look similar to V1 (even without
circuitous chunk optimizations.)
STill we see similar 50-60% improvement for 1G and 2M page sizes.


SUT: Bergamo
    CPU family:          25
    Model:               160
    Thread(s) per core:  2
    Core(s) per socket:  128
    Socket(s):           2

NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-127,256-383
  NUMA node1 CPU(s):     128-255,384-511

Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for both base-hugepage-size=2M and 1GB Current result is with thp = always, but madv also did not make much difference.

perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>

time in seconds elapsed (average of 10 runs) (lower = better)

Result:
base: mm/clear_huge_page
patched: x86/clear_huge_page

page-size  base       patched     Improvement %
2M         5.0779     2.50623     50.64
1G         2.50623    1.012439    59.60

More details:

 Performance counter stats for 'mm/map_hugetlb' (10 runs):

5,058.71 msec task-clock # 0.996 CPUs utilized ( +- 0.26% ) 8 context-switches # 1.576 /sec ( +- 7.23% )
                 0      cpu-migrations            #    0.000 /sec
32,917 page-faults # 6.484 K/sec ( +- 0.00% ) 15,797,804,067 cycles # 3.112 GHz ( +- 0.26% ) (35.70%) 2,073,754 stalled-cycles-frontend # 0.01% frontend cycles idle ( +- 1.25% ) (35.71%) 27,508,977 stalled-cycles-backend # 0.17% backend cycles idle ( +- 9.48% ) (35.74%)
     1,143,710,651      instructions              #    0.07  insn per cycle
# 0.03 stalled cycles per insn ( +- 0.15% ) (35.76%) 243,817,330 branches # 48.028 M/sec ( +- 0.12% ) (35.78%) 357,760 branch-misses # 0.15% of all branches ( +- 1.52% ) (35.75%) 2,540,733,497 L1-dcache-loads # 500.483 M/sec ( +- 0.04% ) (35.74%) 1,093,660,557 L1-dcache-load-misses # 42.98% of all L1-dcache accesses ( +- 0.03% ) (35.71%) 73,335,478 L1-icache-loads # 14.446 M/sec ( +- 0.08% ) (35.70%) 878,378 L1-icache-load-misses # 1.19% of all L1-icache accesses ( +- 2.65% ) (35.68%) 1,025,714 dTLB-loads # 202.049 K/sec ( +- 2.70% ) (35.69%) 405,407 dTLB-load-misses # 37.35% of all dTLB cache accesses ( +- 1.59% ) (35.68%) 2 iTLB-loads # 0.394 /sec ( +- 41.63% ) (35.68%) 40,356 iTLB-load-misses # 1552153.85% of all iTLB cache accesses ( +- 7.18% ) (35.68%)

            5.0779 +- 0.0132 seconds time elapsed  ( +-  0.26% )

Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb' (10 runs):

2,538.40 msec task-clock # 1.013 CPUs utilized ( +- 0.27% ) 4 context-switches # 1.597 /sec ( +- 6.51% )
                 1      cpu-migrations            #    0.399 /sec
32,916 page-faults # 13.140 K/sec ( +- 0.00% ) 7,901,830,782 cycles # 3.154 GHz ( +- 0.27% ) (35.67%) 6,590,473 stalled-cycles-frontend # 0.08% frontend cycles idle ( +- 10.31% ) (35.71%) 329,970,288 stalled-cycles-backend # 4.23% backend cycles idle ( +- 13.65% ) (35.74%)
       725,811,962      instructions              #    0.09  insn per cycle
# 0.80 stalled cycles per insn ( +- 0.37% ) (35.78%) 132,182,704 branches # 52.767 M/sec ( +- 0.26% ) (35.82%) 254,163 branch-misses # 0.19% of all branches ( +- 2.47% ) (35.81%) 2,382,927,453 L1-dcache-loads # 951.262 M/sec ( +- 0.04% ) (35.77%) 1,082,022,067 L1-dcache-load-misses # 45.41% of all L1-dcache accesses ( +- 0.02% ) (35.74%) 47,164,491 L1-icache-loads # 18.828 M/sec ( +- 0.37% ) (35.70%) 474,535 L1-icache-load-misses # 0.99% of all L1-icache accesses ( +- 2.93% ) (35.66%) 1,477,334 dTLB-loads # 589.750 K/sec ( +- 5.12% ) (35.65%) 624,125 dTLB-load-misses # 56.24% of all dTLB cache accesses ( +- 5.66% ) (35.65%) 0 iTLB-loads # 0.000 /sec (35.65%) 1,626 iTLB-load-misses # 7069.57% of all iTLB cache accesses ( +-283.51% ) (35.65%)

           2.50623 +- 0.00691 seconds time elapsed  ( +-  0.28% )


Performance counter stats for 'numactl -m 0 -N 0 mm/map_hugetlb_1G' (10 runs):


2,506.50 msec task-clock # 0.995 CPUs utilized ( +- 0.17% ) 4 context-switches # 1.589 /sec ( +- 9.28% )
                 0      cpu-migrations            #    0.000 /sec
214 page-faults # 84.997 /sec ( +- 0.13% ) 7,821,519,053 cycles # 3.107 GHz ( +- 0.17% ) (35.72%) 2,037,744 stalled-cycles-frontend # 0.03% frontend cycles idle ( +- 25.62% ) (35.73%) 6,578,899 stalled-cycles-backend # 0.08% backend cycles idle ( +- 2.65% ) (35.73%)
       468,648,780      instructions              #    0.06  insn per cycle
# 0.01 stalled cycles per insn ( +- 0.10% ) (35.73%) 116,267,370 branches # 46.179 M/sec ( +- 0.08% ) (35.73%) 111,966 branch-misses # 0.10% of all branches ( +- 2.98% ) (35.72%) 2,294,727,165 L1-dcache-loads # 911.424 M/sec ( +- 0.02% ) (35.71%) 1,076,156,463 L1-dcache-load-misses # 46.88% of all L1-dcache accesses ( +- 0.01% ) (35.70%) 26,093,151 L1-icache-loads # 10.364 M/sec ( +- 0.21% ) (35.71%) 132,944 L1-icache-load-misses # 0.51% of all L1-icache accesses ( +- 0.55% ) (35.70%) 30,925 dTLB-loads # 12.283 K/sec ( +- 5.70% ) (35.71%) 27,437 dTLB-load-misses # 86.22% of all dTLB cache accesses ( +- 1.98% ) (35.70%) 0 iTLB-loads # 0.000 /sec (35.71%) 11 iTLB-load-misses # 62.50% of all iTLB cache accesses ( +-140.21% ) (35.70%)

           2.51890 +- 0.00433 seconds time elapsed  ( +-  0.17% )

Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb_1G' (10 runs):

1,013.59 msec task-clock # 1.001 CPUs utilized ( +- 0.07% ) 2 context-switches # 1.978 /sec ( +- 12.91% )
                 1      cpu-migrations            #    0.989 /sec
213 page-faults # 210.634 /sec ( +- 0.17% ) 3,169,391,694 cycles # 3.134 GHz ( +- 0.07% ) (35.53%) 109,925 stalled-cycles-frontend # 0.00% frontend cycles idle ( +- 5.56% ) (35.63%) 950,638,913 stalled-cycles-backend # 30.06% backend cycles idle ( +- 5.06% ) (35.73%)
        51,189,571      instructions              #    0.02  insn per cycle
# 21.03 stalled cycles per insn ( +- 1.22% ) (35.82%) 9,545,941 branches # 9.440 M/sec ( +- 1.50% ) (35.92%) 86,836 branch-misses # 0.88% of all branches ( +- 3.74% ) (36.00%) 46,109,587 L1-dcache-loads # 45.597 M/sec ( +- 3.92% ) (35.96%) 13,796,172 L1-dcache-load-misses # 41.77% of all L1-dcache accesses ( +- 4.81% ) (35.85%) 1,179,166 L1-icache-loads # 1.166 M/sec ( +- 1.22% ) (35.77%) 21,528 L1-icache-load-misses # 1.90% of all L1-icache accesses ( +- 1.85% ) (35.66%) 14,529 dTLB-loads # 14.368 K/sec ( +- 4.65% ) (35.57%) 8,505 dTLB-load-misses # 67.88% of all dTLB cache accesses ( +- 5.61% ) (35.52%) 0 iTLB-loads # 0.000 /sec (35.52%) 8 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +-267.99% ) (35.52%)

          1.012439 +- 0.000723 seconds time elapsed  ( +-  0.07% )


Please feel free to carry:

Tested-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
for any minor changes.

Thanks and Regards
- Raghu




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux