Re: [PATCH net-next v23 0/7] Replace page_frag with page_frag_cache (Part-1)

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Mon, 28 Oct 2024 08:30:45 -0700

On Mon, Oct 28, 2024 at 5:00 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
>
> This is part 1 of "Replace page_frag with page_frag_cache",
> which mainly contain refactoring and optimization for the
> implementation of page_frag API before the replacing.
>
> As the discussion in [1], it would be better to target net-next
> tree to get more testing as all the callers page_frag API are
> in networking, and the chance of conflicting with MM tree seems
> low as implementation of page_frag API seems quite self-contained.
>
> After [2], there are still two implementations for page frag:
>
> 1. mm/page_alloc.c: net stack seems to be using it in the
>    rx part with 'struct page_frag_cache' and the main API
>    being page_frag_alloc_align().
> 2. net/core/sock.c: net stack seems to be using it in the
>    tx part with 'struct page_frag' and the main API being
>    skb_page_frag_refill().
>
> This patchset tries to unfiy the page frag implementation
> by replacing page_frag with page_frag_cache for sk_page_frag()
> first. net_high_order_alloc_disable_key for the implementation
> in net/core/sock.c doesn't seems matter that much now as pcp
> is also supported for high-order pages:
> commit 44042b449872 ("mm/page_alloc: allow high-order pages to
> be stored on the per-cpu lists")
>
> As the related change is mostly related to networking, so
> targeting the net-next. And will try to replace the rest
> of page_frag in the follow patchset.
>
> After this patchset:
> 1. Unify the page frag implementation by taking the best out of
>    two the existing implementations: we are able to save some space
>    for the 'page_frag_cache' API user, and avoid 'get_page()' for
>    the old 'page_frag' API user.
> 2. Future bugfix and performance can be done in one place, hence
>    improving maintainability of page_frag's implementation.
>
> Kernel Image changing:
>     Linux Kernel   total |      text      data        bss
>     ------------------------------------------------------
>     after     45250307 |   27274279   17209996     766032
>     before    45254134 |   27278118   17209984     766032
>     delta        -3827 |      -3839        +12         +0
>
> Performance validation:
> 1. Using micro-benchmark ko added in patch 1 to test aligned and
>    non-aligned API performance impact for the existing users, there
>    is no notiable performance degradation. Instead we seems to have
>    some major performance boot for both aligned and non-aligned API
>    after switching to ptr_ring for testing, respectively about 200%
>    and 10% improvement in arm64 server as below.
>
> 2. Use the below netcat test case, we also have some minor
>    performance boot for replacing 'page_frag' with 'page_frag_cache'
>    after this patchset.
>    server: taskset -c 32 nc -l -k 1234 > /dev/null
>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
>
> In order to avoid performance noise as much as possible, the testing
> is done in system without any other load and have enough iterations to
> prove the data is stable enough, complete log for testing is below:
>
> perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000
> perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1
> taskset -c 32 nc -l -k 1234 > /dev/null
> perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
>
> *After* this patchset:
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):
>
>          17.758393      task-clock (msec)         #    0.004 CPUs utilized            ( +-  0.51% )
>                  5      context-switches          #    0.293 K/sec                    ( +-  0.65% )
>                  0      cpu-migrations            #    0.008 K/sec                    ( +- 17.21% )
>                 74      page-faults               #    0.004 M/sec                    ( +-  0.12% )
>           46128650      cycles                    #    2.598 GHz                      ( +-  0.51% )
>           60810511      instructions              #    1.32  insn per cycle           ( +-  0.04% )
>           14764914      branches                  #  831.433 M/sec                    ( +-  0.04% )
>              19281      branch-misses             #    0.13% of all branches          ( +-  0.13% )
>
>        4.240273854 seconds time elapsed                                          ( +-  0.13% )
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):
>
>          17.348690      task-clock (msec)         #    0.019 CPUs utilized            ( +-  0.66% )
>                  5      context-switches          #    0.310 K/sec                    ( +-  0.84% )
>                  0      cpu-migrations            #    0.009 K/sec                    ( +- 16.55% )
>                 74      page-faults               #    0.004 M/sec                    ( +-  0.11% )
>           45065287      cycles                    #    2.598 GHz                      ( +-  0.66% )
>           60755389      instructions              #    1.35  insn per cycle           ( +-  0.05% )
>           14747865      branches                  #  850.085 M/sec                    ( +-  0.05% )
>              19272      branch-misses             #    0.13% of all branches          ( +-  0.13% )
>
>        0.935251375 seconds time elapsed                                          ( +-  0.07% )
>
>  Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):
>
>       16626.042731      task-clock (msec)         #    0.607 CPUs utilized            ( +-  0.03% )
>            3291020      context-switches          #    0.198 M/sec                    ( +-  0.05% )
>                  1      cpu-migrations            #    0.000 K/sec                    ( +-  0.50% )
>                 85      page-faults               #    0.005 K/sec                    ( +-  0.16% )
>        30581044838      cycles                    #    1.839 GHz                      ( +-  0.05% )
>        34962744631      instructions              #    1.14  insn per cycle           ( +-  0.01% )
>         6483883671      branches                  #  389.984 M/sec                    ( +-  0.02% )
>           99624551      branch-misses             #    1.54% of all branches          ( +-  0.17% )
>
>       27.370305077 seconds time elapsed                                          ( +-  0.01% )
>
>
> *Before* this patchset:
>
> Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):
>
>          21.587934      task-clock (msec)         #    0.005 CPUs utilized            ( +-  0.72% )
>                  6      context-switches          #    0.281 K/sec                    ( +-  0.28% )
>                  1      cpu-migrations            #    0.047 K/sec                    ( +-  0.50% )
>                 73      page-faults               #    0.003 M/sec                    ( +-  0.12% )
>           56080697      cycles                    #    2.598 GHz                      ( +-  0.72% )
>           61605150      instructions              #    1.10  insn per cycle           ( +-  0.05% )
>           14950196      branches                  #  692.526 M/sec                    ( +-  0.05% )
>              19410      branch-misses             #    0.13% of all branches          ( +-  0.18% )
>
>        4.603530546 seconds time elapsed                                          ( +-  0.11% )
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):
>
>          20.988297      task-clock (msec)         #    0.006 CPUs utilized            ( +-  0.81% )
>                  7      context-switches          #    0.316 K/sec                    ( +-  0.54% )
>                  1      cpu-migrations            #    0.048 K/sec                    ( +-  0.70% )
>                 73      page-faults               #    0.003 M/sec                    ( +-  0.11% )
>           54512166      cycles                    #    2.597 GHz                      ( +-  0.81% )
>           61440941      instructions              #    1.13  insn per cycle           ( +-  0.08% )
>           14906043      branches                  #  710.207 M/sec                    ( +-  0.08% )
>              19927      branch-misses             #    0.13% of all branches          ( +-  0.17% )
>
>        3.438041238 seconds time elapsed                                          ( +-  1.11% )
>
>  Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):
>
>       17364.040855      task-clock (msec)         #    0.624 CPUs utilized            ( +-  0.02% )
>            3340375      context-switches          #    0.192 M/sec                    ( +-  0.06% )
>                  1      cpu-migrations            #    0.000 K/sec
>                 85      page-faults               #    0.005 K/sec                    ( +-  0.15% )
>        32077623335      cycles                    #    1.847 GHz                      ( +-  0.03% )
>        35121047596      instructions              #    1.09  insn per cycle           ( +-  0.01% )
>         6519872824      branches                  #  375.481 M/sec                    ( +-  0.02% )
>          101877022      branch-misses             #    1.56% of all branches          ( +-  0.14% )
>
>       27.842745343 seconds time elapsed                                          ( +-  0.02% )
>
>

Is this actually the numbers for this patch set? Seems like you have
been using the same numbers for the last several releases. I can
understand the "before" being mostly the same, but since we have
factored out the refactor portion of it the numbers for the "after"
should have deviated as I find it highly unlikely the numbers are
exactly the same down to the nanosecond. from the previous patch set.

Also it wouldn't hurt to have an explanation for the 3.4->0.9 second
performance change as it seems like the samples don't seem to match up
with the elapsed time data.