On Mon, Oct 28, 2024 at 5:00 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote: > > This is part 1 of "Replace page_frag with page_frag_cache", > which mainly contain refactoring and optimization for the > implementation of page_frag API before the replacing. > > As the discussion in [1], it would be better to target net-next > tree to get more testing as all the callers page_frag API are > in networking, and the chance of conflicting with MM tree seems > low as implementation of page_frag API seems quite self-contained. > > After [2], there are still two implementations for page frag: > > 1. mm/page_alloc.c: net stack seems to be using it in the > rx part with 'struct page_frag_cache' and the main API > being page_frag_alloc_align(). > 2. net/core/sock.c: net stack seems to be using it in the > tx part with 'struct page_frag' and the main API being > skb_page_frag_refill(). > > This patchset tries to unfiy the page frag implementation > by replacing page_frag with page_frag_cache for sk_page_frag() > first. net_high_order_alloc_disable_key for the implementation > in net/core/sock.c doesn't seems matter that much now as pcp > is also supported for high-order pages: > commit 44042b449872 ("mm/page_alloc: allow high-order pages to > be stored on the per-cpu lists") > > As the related change is mostly related to networking, so > targeting the net-next. And will try to replace the rest > of page_frag in the follow patchset. > > After this patchset: > 1. Unify the page frag implementation by taking the best out of > two the existing implementations: we are able to save some space > for the 'page_frag_cache' API user, and avoid 'get_page()' for > the old 'page_frag' API user. > 2. Future bugfix and performance can be done in one place, hence > improving maintainability of page_frag's implementation. > > Kernel Image changing: > Linux Kernel total | text data bss > ------------------------------------------------------ > after 45250307 | 27274279 17209996 766032 > before 45254134 | 27278118 17209984 766032 > delta -3827 | -3839 +12 +0 > > Performance validation: > 1. Using micro-benchmark ko added in patch 1 to test aligned and > non-aligned API performance impact for the existing users, there > is no notiable performance degradation. Instead we seems to have > some major performance boot for both aligned and non-aligned API > after switching to ptr_ring for testing, respectively about 200% > and 10% improvement in arm64 server as below. > > 2. Use the below netcat test case, we also have some minor > performance boot for replacing 'page_frag' with 'page_frag_cache' > after this patchset. > server: taskset -c 32 nc -l -k 1234 > /dev/null > client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234 > > In order to avoid performance noise as much as possible, the testing > is done in system without any other load and have enough iterations to > prove the data is stable enough, complete log for testing is below: > > perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 > perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1 > taskset -c 32 nc -l -k 1234 > /dev/null > perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234 > > *After* this patchset: > > Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs): > > 17.758393 task-clock (msec) # 0.004 CPUs utilized ( +- 0.51% ) > 5 context-switches # 0.293 K/sec ( +- 0.65% ) > 0 cpu-migrations # 0.008 K/sec ( +- 17.21% ) > 74 page-faults # 0.004 M/sec ( +- 0.12% ) > 46128650 cycles # 2.598 GHz ( +- 0.51% ) > 60810511 instructions # 1.32 insn per cycle ( +- 0.04% ) > 14764914 branches # 831.433 M/sec ( +- 0.04% ) > 19281 branch-misses # 0.13% of all branches ( +- 0.13% ) > > 4.240273854 seconds time elapsed ( +- 0.13% ) > > Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs): > > 17.348690 task-clock (msec) # 0.019 CPUs utilized ( +- 0.66% ) > 5 context-switches # 0.310 K/sec ( +- 0.84% ) > 0 cpu-migrations # 0.009 K/sec ( +- 16.55% ) > 74 page-faults # 0.004 M/sec ( +- 0.11% ) > 45065287 cycles # 2.598 GHz ( +- 0.66% ) > 60755389 instructions # 1.35 insn per cycle ( +- 0.05% ) > 14747865 branches # 850.085 M/sec ( +- 0.05% ) > 19272 branch-misses # 0.13% of all branches ( +- 0.13% ) > > 0.935251375 seconds time elapsed ( +- 0.07% ) > > Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs): > > 16626.042731 task-clock (msec) # 0.607 CPUs utilized ( +- 0.03% ) > 3291020 context-switches # 0.198 M/sec ( +- 0.05% ) > 1 cpu-migrations # 0.000 K/sec ( +- 0.50% ) > 85 page-faults # 0.005 K/sec ( +- 0.16% ) > 30581044838 cycles # 1.839 GHz ( +- 0.05% ) > 34962744631 instructions # 1.14 insn per cycle ( +- 0.01% ) > 6483883671 branches # 389.984 M/sec ( +- 0.02% ) > 99624551 branch-misses # 1.54% of all branches ( +- 0.17% ) > > 27.370305077 seconds time elapsed ( +- 0.01% ) > > > *Before* this patchset: > > Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs): > > 21.587934 task-clock (msec) # 0.005 CPUs utilized ( +- 0.72% ) > 6 context-switches # 0.281 K/sec ( +- 0.28% ) > 1 cpu-migrations # 0.047 K/sec ( +- 0.50% ) > 73 page-faults # 0.003 M/sec ( +- 0.12% ) > 56080697 cycles # 2.598 GHz ( +- 0.72% ) > 61605150 instructions # 1.10 insn per cycle ( +- 0.05% ) > 14950196 branches # 692.526 M/sec ( +- 0.05% ) > 19410 branch-misses # 0.13% of all branches ( +- 0.18% ) > > 4.603530546 seconds time elapsed ( +- 0.11% ) > > Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs): > > 20.988297 task-clock (msec) # 0.006 CPUs utilized ( +- 0.81% ) > 7 context-switches # 0.316 K/sec ( +- 0.54% ) > 1 cpu-migrations # 0.048 K/sec ( +- 0.70% ) > 73 page-faults # 0.003 M/sec ( +- 0.11% ) > 54512166 cycles # 2.597 GHz ( +- 0.81% ) > 61440941 instructions # 1.13 insn per cycle ( +- 0.08% ) > 14906043 branches # 710.207 M/sec ( +- 0.08% ) > 19927 branch-misses # 0.13% of all branches ( +- 0.17% ) > > 3.438041238 seconds time elapsed ( +- 1.11% ) > > Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs): > > 17364.040855 task-clock (msec) # 0.624 CPUs utilized ( +- 0.02% ) > 3340375 context-switches # 0.192 M/sec ( +- 0.06% ) > 1 cpu-migrations # 0.000 K/sec > 85 page-faults # 0.005 K/sec ( +- 0.15% ) > 32077623335 cycles # 1.847 GHz ( +- 0.03% ) > 35121047596 instructions # 1.09 insn per cycle ( +- 0.01% ) > 6519872824 branches # 375.481 M/sec ( +- 0.02% ) > 101877022 branch-misses # 1.56% of all branches ( +- 0.14% ) > > 27.842745343 seconds time elapsed ( +- 0.02% ) > > Is this actually the numbers for this patch set? Seems like you have been using the same numbers for the last several releases. I can understand the "before" being mostly the same, but since we have factored out the refactor portion of it the numbers for the "after" should have deviated as I find it highly unlikely the numbers are exactly the same down to the nanosecond. from the previous patch set. Also it wouldn't hurt to have an explanation for the 3.4->0.9 second performance change as it seems like the samples don't seem to match up with the elapsed time data.