On 8/31/2023 12:19 AM, Ankur Arora wrote:
This series adds a multi-page clearing primitive, clear_pages(),
which enables more effective use of x86 string instructions by
advertising the real region-size to be cleared.
Region-size can be used as a hint by uarchs to optimize the
clearing.
Also add allow_resched() which marks a code-section as allowing
rescheduling in the irqentry_exit path. This allows clear_pages()
to get by without having to call cond_sched() periodically.
(preempt_model_full() already handles this via
irqentry_exit_cond_resched(), so we handle this similarly for
preempt_model_none() and preempt_model_voluntary().)
Hello Ankur,
Thansk for the patches.
I tried the patches, Improvements look similar to V1 (even without
circuitous chunk optimizations.)
STill we see similar 50-60% improvement for 1G and 2M page sizes.
SUT: Bergamo
CPU family: 25
Model: 160
Thread(s) per core: 2
Core(s) per socket: 128
Socket(s): 2
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-127,256-383
NUMA node1 CPU(s): 128-255,384-511
Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA
node0), for both base-hugepage-size=2M and 1GB
Current result is with thp = always, but madv also did not make much
difference.
perf stat -r 10 -d -d numactl -m 0 -N 0 <test>
time in seconds elapsed (average of 10 runs) (lower = better)
Result:
base: mm/clear_huge_page
patched: x86/clear_huge_page
page-size base patched Improvement %
2M 5.0779 2.50623 50.64
1G 2.50623 1.012439 59.60
More details:
Performance counter stats for 'mm/map_hugetlb' (10 runs):
5,058.71 msec task-clock # 0.996 CPUs
utilized ( +- 0.26% )
8 context-switches # 1.576 /sec
( +- 7.23% )
0 cpu-migrations # 0.000 /sec
32,917 page-faults # 6.484 K/sec
( +- 0.00% )
15,797,804,067 cycles # 3.112 GHz
( +- 0.26% ) (35.70%)
2,073,754 stalled-cycles-frontend # 0.01% frontend
cycles idle ( +- 1.25% ) (35.71%)
27,508,977 stalled-cycles-backend # 0.17% backend
cycles idle ( +- 9.48% ) (35.74%)
1,143,710,651 instructions # 0.07 insn per cycle
# 0.03 stalled
cycles per insn ( +- 0.15% ) (35.76%)
243,817,330 branches # 48.028 M/sec
( +- 0.12% ) (35.78%)
357,760 branch-misses # 0.15% of all
branches ( +- 1.52% ) (35.75%)
2,540,733,497 L1-dcache-loads # 500.483 M/sec
( +- 0.04% ) (35.74%)
1,093,660,557 L1-dcache-load-misses # 42.98% of all
L1-dcache accesses ( +- 0.03% ) (35.71%)
73,335,478 L1-icache-loads # 14.446 M/sec
( +- 0.08% ) (35.70%)
878,378 L1-icache-load-misses # 1.19% of all
L1-icache accesses ( +- 2.65% ) (35.68%)
1,025,714 dTLB-loads # 202.049 K/sec
( +- 2.70% ) (35.69%)
405,407 dTLB-load-misses # 37.35% of all
dTLB cache accesses ( +- 1.59% ) (35.68%)
2 iTLB-loads # 0.394 /sec
( +- 41.63% ) (35.68%)
40,356 iTLB-load-misses # 1552153.85% of all
iTLB cache accesses ( +- 7.18% ) (35.68%)
5.0779 +- 0.0132 seconds time elapsed ( +- 0.26% )
Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb' (10
runs):
2,538.40 msec task-clock # 1.013 CPUs
utilized ( +- 0.27% )
4 context-switches # 1.597 /sec
( +- 6.51% )
1 cpu-migrations # 0.399 /sec
32,916 page-faults # 13.140 K/sec
( +- 0.00% )
7,901,830,782 cycles # 3.154 GHz
( +- 0.27% ) (35.67%)
6,590,473 stalled-cycles-frontend # 0.08% frontend
cycles idle ( +- 10.31% ) (35.71%)
329,970,288 stalled-cycles-backend # 4.23% backend
cycles idle ( +- 13.65% ) (35.74%)
725,811,962 instructions # 0.09 insn per cycle
# 0.80 stalled
cycles per insn ( +- 0.37% ) (35.78%)
132,182,704 branches # 52.767 M/sec
( +- 0.26% ) (35.82%)
254,163 branch-misses # 0.19% of all
branches ( +- 2.47% ) (35.81%)
2,382,927,453 L1-dcache-loads # 951.262 M/sec
( +- 0.04% ) (35.77%)
1,082,022,067 L1-dcache-load-misses # 45.41% of all
L1-dcache accesses ( +- 0.02% ) (35.74%)
47,164,491 L1-icache-loads # 18.828 M/sec
( +- 0.37% ) (35.70%)
474,535 L1-icache-load-misses # 0.99% of all
L1-icache accesses ( +- 2.93% ) (35.66%)
1,477,334 dTLB-loads # 589.750 K/sec
( +- 5.12% ) (35.65%)
624,125 dTLB-load-misses # 56.24% of all
dTLB cache accesses ( +- 5.66% ) (35.65%)
0 iTLB-loads # 0.000 /sec
(35.65%)
1,626 iTLB-load-misses # 7069.57% of all
iTLB cache accesses ( +-283.51% ) (35.65%)
2.50623 +- 0.00691 seconds time elapsed ( +- 0.28% )
Performance counter stats for 'numactl -m 0 -N 0 mm/map_hugetlb_1G'
(10 runs):
2,506.50 msec task-clock # 0.995 CPUs
utilized ( +- 0.17% )
4 context-switches # 1.589 /sec
( +- 9.28% )
0 cpu-migrations # 0.000 /sec
214 page-faults # 84.997 /sec
( +- 0.13% )
7,821,519,053 cycles # 3.107 GHz
( +- 0.17% ) (35.72%)
2,037,744 stalled-cycles-frontend # 0.03% frontend
cycles idle ( +- 25.62% ) (35.73%)
6,578,899 stalled-cycles-backend # 0.08% backend
cycles idle ( +- 2.65% ) (35.73%)
468,648,780 instructions # 0.06 insn per cycle
# 0.01 stalled
cycles per insn ( +- 0.10% ) (35.73%)
116,267,370 branches # 46.179 M/sec
( +- 0.08% ) (35.73%)
111,966 branch-misses # 0.10% of all
branches ( +- 2.98% ) (35.72%)
2,294,727,165 L1-dcache-loads # 911.424 M/sec
( +- 0.02% ) (35.71%)
1,076,156,463 L1-dcache-load-misses # 46.88% of all
L1-dcache accesses ( +- 0.01% ) (35.70%)
26,093,151 L1-icache-loads # 10.364 M/sec
( +- 0.21% ) (35.71%)
132,944 L1-icache-load-misses # 0.51% of all
L1-icache accesses ( +- 0.55% ) (35.70%)
30,925 dTLB-loads # 12.283 K/sec
( +- 5.70% ) (35.71%)
27,437 dTLB-load-misses # 86.22% of all
dTLB cache accesses ( +- 1.98% ) (35.70%)
0 iTLB-loads # 0.000 /sec
(35.71%)
11 iTLB-load-misses # 62.50% of all
iTLB cache accesses ( +-140.21% ) (35.70%)
2.51890 +- 0.00433 seconds time elapsed ( +- 0.17% )
Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb_1G'
(10 runs):
1,013.59 msec task-clock # 1.001 CPUs
utilized ( +- 0.07% )
2 context-switches # 1.978 /sec
( +- 12.91% )
1 cpu-migrations # 0.989 /sec
213 page-faults # 210.634 /sec
( +- 0.17% )
3,169,391,694 cycles # 3.134 GHz
( +- 0.07% ) (35.53%)
109,925 stalled-cycles-frontend # 0.00% frontend
cycles idle ( +- 5.56% ) (35.63%)
950,638,913 stalled-cycles-backend # 30.06% backend
cycles idle ( +- 5.06% ) (35.73%)
51,189,571 instructions # 0.02 insn per cycle
# 21.03 stalled
cycles per insn ( +- 1.22% ) (35.82%)
9,545,941 branches # 9.440 M/sec
( +- 1.50% ) (35.92%)
86,836 branch-misses # 0.88% of all
branches ( +- 3.74% ) (36.00%)
46,109,587 L1-dcache-loads # 45.597 M/sec
( +- 3.92% ) (35.96%)
13,796,172 L1-dcache-load-misses # 41.77% of all
L1-dcache accesses ( +- 4.81% ) (35.85%)
1,179,166 L1-icache-loads # 1.166 M/sec
( +- 1.22% ) (35.77%)
21,528 L1-icache-load-misses # 1.90% of all
L1-icache accesses ( +- 1.85% ) (35.66%)
14,529 dTLB-loads # 14.368 K/sec
( +- 4.65% ) (35.57%)
8,505 dTLB-load-misses # 67.88% of all
dTLB cache accesses ( +- 5.61% ) (35.52%)
0 iTLB-loads # 0.000 /sec
(35.52%)
8 iTLB-load-misses # 0.00% of all
iTLB cache accesses ( +-267.99% ) (35.52%)
1.012439 +- 0.000723 seconds time elapsed ( +- 0.07% )
Please feel free to carry:
Tested-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
for any minor changes.
Thanks and Regards
- Raghu