On Tue, Dec 10, 2024 at 4:27 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote: > > On 2024/12/10 0:03, Alexander Duyck wrote: > > ... > > > > > Other than code size have you tried using perf to profile the > > benchmark before and after. I suspect that would be telling about > > which code changes are the most likely to be causing the issues. > > Overall I don't think the size has increased all that much. I suspect > > most of this is the fact that you are inlining more of the > > functionality. > > It seems the testing result is very sensitive to code changing and > reorganizing, as using the patch at the end to avoid the problem of > 'perf stat' not including data from the kernel thread seems to provide > more reasonable performance data. > > It seems the most obvious difference is 'insn per cycle' and I am not > sure how to interpret the difference of below data for the performance > degradation yet. > > With patch 1: > Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000': > > 5473.815250 task-clock (msec) # 0.984 CPUs utilized > 18 context-switches # 0.003 K/sec > 1 cpu-migrations # 0.000 K/sec > 122 page-faults # 0.022 K/sec > 14210894727 cycles # 2.596 GHz (92.78%) > 18903171767 instructions # 1.33 insn per cycle (92.82%) > 2997494420 branches # 547.606 M/sec (92.84%) > 7539978 branch-misses # 0.25% of all branches (92.84%) > 6291190031 L1-dcache-loads # 1149.325 M/sec (92.78%) > 29874701 L1-dcache-load-misses # 0.47% of all L1-dcache hits (92.82%) > 57979668 LLC-loads # 10.592 M/sec (92.79%) > 347822 LLC-load-misses # 0.01% of all LL-cache hits (92.90%) > 5946042629 L1-icache-loads # 1086.270 M/sec (92.91%) > 193877 L1-icache-load-misses (92.91%) > 6820220221 dTLB-loads # 1245.972 M/sec (92.91%) > 137999 dTLB-load-misses # 0.00% of all dTLB cache hits (92.91%) > 5947607438 iTLB-loads # 1086.556 M/sec (92.91%) > 210 iTLB-load-misses # 0.00% of all iTLB cache hits (85.66%) > <not supported> L1-dcache-prefetches > <not supported> L1-dcache-prefetch-misses > > 5.563068950 seconds time elapsed > > Without patch 1: > root@(none):/home# perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 > insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable > > Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000': > > 5306.644600 task-clock (msec) # 0.984 CPUs utilized > 15 context-switches # 0.003 K/sec > 1 cpu-migrations # 0.000 K/sec > 122 page-faults # 0.023 K/sec > 13776872322 cycles # 2.596 GHz (92.84%) > 13257649773 instructions # 0.96 insn per cycle (92.82%) > 2446901087 branches # 461.101 M/sec (92.91%) > 7172751 branch-misses # 0.29% of all branches (92.84%) > 5041456343 L1-dcache-loads # 950.027 M/sec (92.84%) > 38418414 L1-dcache-load-misses # 0.76% of all L1-dcache hits (92.76%) > 65486400 LLC-loads # 12.340 M/sec (92.82%) > 191497 LLC-load-misses # 0.01% of all LL-cache hits (92.79%) > 4906456833 L1-icache-loads # 924.587 M/sec (92.90%) > 175208 L1-icache-load-misses (92.91%) > 5539879607 dTLB-loads # 1043.952 M/sec (92.91%) > 140166 dTLB-load-misses # 0.00% of all dTLB cache hits (92.91%) > 4906685698 iTLB-loads # 924.631 M/sec (92.91%) > 170 iTLB-load-misses # 0.00% of all iTLB cache hits (85.66%) > <not supported> L1-dcache-prefetches > <not supported> L1-dcache-prefetch-misses > > 5.395104330 seconds time elapsed > > > Below is perf data for aligned API without patch 1, as above non-aligned > API also use test_alloc_len as 12, theoretically the performance data > should not be better than the non-aligned API as the aligned API will do > the aligning of fragsz basing on SMP_CACHE_BYTES, but the testing seems > to show otherwise and I am not sure how to interpret that too: > perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1 > insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable > > Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1': > > 2447.553100 task-clock (msec) # 0.965 CPUs utilized > 9 context-switches # 0.004 K/sec > 1 cpu-migrations # 0.000 K/sec > 122 page-faults # 0.050 K/sec > 6354149177 cycles # 2.596 GHz (92.81%) > 6467793726 instructions # 1.02 insn per cycle (92.76%) > 1120749183 branches # 457.906 M/sec (92.81%) > 7370402 branch-misses # 0.66% of all branches (92.81%) > 2847963759 L1-dcache-loads # 1163.596 M/sec (92.76%) > 39439592 L1-dcache-load-misses # 1.38% of all L1-dcache hits (92.77%) > 42553468 LLC-loads # 17.386 M/sec (92.71%) > 95960 LLC-load-misses # 0.01% of all LL-cache hits (92.94%) > 2554887203 L1-icache-loads # 1043.854 M/sec (92.97%) > 118902 L1-icache-load-misses (92.97%) > 3365755289 dTLB-loads # 1375.151 M/sec (92.97%) > 81401 dTLB-load-misses # 0.00% of all dTLB cache hits (92.97%) > 2554882937 iTLB-loads # 1043.852 M/sec (92.97%) > 159 iTLB-load-misses # 0.00% of all iTLB cache hits (85.58%) > <not supported> L1-dcache-prefetches > <not supported> L1-dcache-prefetch-misses > > 2.535085780 seconds time elapsed I'm not sure perf stat will tell us much as it is really too high level to give us much in the way of details. I would be more interested in the output from perf record -g followed by a perf report, or maybe even just a snapshot from perf top while the test is running. That should show us where the CPU is spending most of its time and what areas are hot in the before and after graphs.