Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2)

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Mon, 9 Dec 2024 08:03:51 -0800

On Mon, Dec 9, 2024 at 3:42 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
>
> On 2024/12/9 5:34, Alexander Duyck wrote:
>
> ...
>
> >>
> >> Performance validation for part2:
> >> 1. Using micro-benchmark ko added in patch 1 to test aligned and
> >>    non-aligned API performance impact for the existing users, there
> >>    seems to be about 20% performance degradation for refactoring
> >>    page_frag to support the new API, which seems to nullify most of
> >>    the performance gain in [3] of part1.
> >
> > So if I am understanding correctly then this is showing a 20%
> > performance degradation with this patchset. I would argue that it is
> > significant enough that it would be a blocking factor for this patch
> > set. I would suggest bisecting the patch set to identify where the
> > performance degradation has been added and see what we can do to
> > resolve it, and if nothing else document it in that patch so we can
> > identify the root cause for the slowdown.
>
> The only patch in this patchset affecting the performance of existing API
> seems to be patch 1, only including patch 1 does show ~20% performance
> degradation as including the whole patchset does:
> mm: page_frag: some minor refactoring before adding new API
>
> And the cause seems to be about the binary increasing as below, as the
> performance degradation didn't seems to change much when I tried inlining
> the __page_frag_cache_commit_noref() by moving it to the header file:
>
> ./scripts/bloat-o-meter vmlinux_orig vmlinux
> add/remove: 3/2 grow/shrink: 5/0 up/down: 920/-500 (420)
> Function                                     old     new   delta
> __page_frag_cache_prepare                      -     500    +500
> __napi_alloc_frag_align                       68     180    +112
> __netdev_alloc_skb                           488     596    +108
> napi_alloc_skb                               556     624     +68
> __netdev_alloc_frag_align                    196     252     +56
> svc_tcp_sendmsg                              340     376     +36
> __page_frag_cache_commit_noref                 -      32     +32
> e843419@09a6_0000bd47_30                       -       8      +8
> e843419@0369_000044ee_684                      8       -      -8
> __page_frag_alloc_align                      492       -    -492
> Total: Before=34719207, After=34719627, chg +0.00%
>
> ./scripts/bloat-o-meter page_frag_test_orig.ko page_frag_test.ko
> add/remove: 0/0 grow/shrink: 2/0 up/down: 78/0 (78)
> Function                                     old     new   delta
> page_frag_push_thread                        508     580     +72
> __UNIQUE_ID_vermagic367                       67      73      +6
> Total: Before=4582, After=4660, chg +1.70%

Other than code size have you tried using perf to profile the
benchmark before and after. I suspect that would be telling about
which code changes are the most likely to be causing the issues.
Overall I don't think the size has increased all that much. I suspect
most of this is the fact that you are inlining more of the
functionality.

> Patch 1 is about refactoring common codes from __page_frag_alloc_va_align()
> to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the
> new API can make use of them as much as possible.
>
> Any better idea to reuse common codes as much as possible while avoiding
> the performance degradation as much as possible?
>
> >
> >> 2. Use the below netcat test case, there seems to be some minor
> >>    performance gain for replacing 'page_frag' with 'page_frag_cache'
> >>    using the new page_frag API after this patchset.
> >>    server: taskset -c 32 nc -l -k 1234 > /dev/null
> >>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
> >
> > This test would barely touch the page pool. The fact is most of the
>
> I am guessing you meant page_frag here?
>
> > overhead for this would likely be things like TCP latency and data
> > copy much more than the page allocation. As such fluctuations here are
> > likely not related to your changes.
>
> But it does tell us something that the replacing does not seems to
> cause obvious regression, right?

Not really. The fragment allocator is such a small portion of this
test that we could probably double the cost for it and it would still
be negligible.

> I tried using a smaller MTU to amplify the impact of page allocation,
> it seemed to have a similar result.

Not surprising. However the network is likely only a small part of
this. I suspect if you ran a profile it would likely show the same.