Re: [RFC v11 00/14] Replace page_frag with page_frag_cache for sk_page_frag()

Alexander Duyck <alexander.duyck@xxxxxxxxx> · Sun, 21 Jul 2024 16:49:02 -0700

On Fri, Jul 19, 2024 at 2:36 AM Yunsheng Lin <linyunsheng@xxxxxxxxxx> wrote:
>
> After [1], there are still two implementations for page frag:
>
> 1. mm/page_alloc.c: net stack seems to be using it in the
>    rx part with 'struct page_frag_cache' and the main API
>    being page_frag_alloc_align().
> 2. net/core/sock.c: net stack seems to be using it in the
>    tx part with 'struct page_frag' and the main API being
>    skb_page_frag_refill().
>
> This patchset tries to unfiy the page frag implementation
> by replacing page_frag with page_frag_cache for sk_page_frag()
> first. net_high_order_alloc_disable_key for the implementation
> in net/core/sock.c doesn't seems matter that much now as pcp
> is also supported for high-order pages:
> commit 44042b449872 ("mm/page_alloc: allow high-order pages to
> be stored on the per-cpu lists")
>
> As the related change is mostly related to networking, so
> targeting the net-next. And will try to replace the rest
> of page_frag in the follow patchset.

So in reality I would say something like the first 4 patches are
probably more applicable to mm than they are to the net-next tree.
Especially given that we are having to deal with the mm_task_types.h
in order to sort out the include order issues.

Given that I think it might make more sense to look at breaking this
into 2 or more patch sets with the first being more mm focused since
the test module and pulling the code out of page_alloc.c, gfp.h, and
mm_types.h would be pretty impactful on mm more than it is on the
networking stack. After those changes then I would agree that we are
mostly just impacting the network stack.

> After this patchset:
> 1. Unify the page frag implementation by taking the best out of
>    two the existing implementations: we are able to save some space
>    for the 'page_frag_cache' API user, and avoid 'get_page()' for
>    the old 'page_frag' API user.
> 2. Future bugfix and performance can be done in one place, hence
>    improving maintainability of page_frag's implementation.
>
> Kernel Image changing:
>     Linux Kernel   total |      text      data        bss
>     ------------------------------------------------------
>     after     45250307 |   27274279   17209996     766032
>     before    45254134 |   27278118   17209984     766032
>     delta        -3827 |      -3839        +12         +0

It might be interesting if we can track some of this on a per patch
basis for the signficant patches. For example I would be curious how
much of this is just from patch 9 that replaces alloc_pages_node with
__alloc_pages.

> Performance validation:
> 1. Using micro-benchmark ko added in patch 1 to test aligned and
>    non-aligned API performance impact for the existing users, there
>    is no notiable performance degradation. Instead we seems to some
>    minor performance boot for both aligned and non-aligned API after
>    this patchset as below.
>
> 2. Use the below netcat test case, we also have some minor
>    performance boot for repalcing 'page_frag' with 'page_frag_cache'
>    after this patchset.
>    server: taskset -c 32 nc -l -k 1234 > /dev/null
>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
>
> In order to avoid performance noise as much as possible, the testing
> is done in system without any other laod and have enough iterations to
> prove the data is stable enogh, complete log for testing is below:

Please go through and run a spell check over your patches. I have seen
multiple spelling errors in a few of the patch descriptions as well.

> taskset -c 32 nc -l -k 1234 > /dev/null
> perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17
> perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_align=1
> perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
>
> *After* this patchset:
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17' (200 runs):
...
>       23.856064365 seconds time elapsed                                          ( +-  0.08% )
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_align=1' (200 runs):
...
>       23.443927103 seconds time elapsed                                          ( +-  0.05% )
>
>  Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):
...
>       27.370305077 seconds time elapsed                                          ( +-  0.01% )
>
>
> *Before* this patchset:
>
> Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17' (200 runs):
...
>       23.918006193 seconds time elapsed                                          ( +-  0.10% )
>
>  Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_align=1' (200 runs):
...
>       23.876306259 seconds time elapsed                                          ( +-  0.13% )
>
>  Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):
...
>       27.842745343 seconds time elapsed                                          ( +-  0.02% )

I'm not seeing any sort of really significant improvement here. If I
am understanding correctly the change amounts to:
23.9 -> 23.9 (-0.26%)
23.9 -> 23.4 (-1.8%)
27.8 -> 27.4 (-1.6%)

The trade off is that we are adding a bunch more complexity to the
code, and refactoring it for reuse, but somehow in the process this
patch set increases the total code footprint in terms of total lines
of code and will make maintenance that much harder as the code seems
much more tangled up and brittle than it was previously.

I think we need to think about splitting this patch set into at least
2. Basically an mm side of the patch set to move the code and do most
of your refactoring work. The second half of the set being to add the
refactoring of the network side to reuse the existing code. I think it
will make it easier to review and might get you more engagement as I
am pretty sure several of these patches likely wouldn't fly with some
of the maintainers.

So specifically I would like to see patches 1 (refactored as
selftest), 2, 3, 5, 7, 8, 13 (current APIs), and 14 done as more of an
mm focused set since many of the issues you seem to have are problems
building due to mm build issues, dependencies, and the like. That is
the foundation for this patch set and it seems like we keep seeing
issues there so that needs to be solid before we can do the new API
work. If focused on mm you might get more eyes on it as not many
networking folks are that familiar with the memory management side of
things.

As for the other patches, specifically 10, 11, 12, and 13 (prepare,
probe, commit API), they could then be spun up as a netdev centered
set. I took a brief look at them but they need some serious refactor
as I think they are providing page as a return value in several cases
where they don't need to.

In my opinion with a small bit of refactoring patch 4 can just be
dropped. I don't think the renaming is necessary and it just adds
noise to the commit logs for the impacted drivers. It will require
tweaks to the other patches but I think it will be better that way in
the long run.

Looking at patch 6 I am left scratching my head and wondering if you
have another build issue of some sort that you haven't mentioned. I
really don't think it belongs in this patch set and should probably be
a fix on its own if you have some reason to justify it. Otherwise you
might also just look at refactoring it to take
"__builtin_constant_p(size)" into account by copying/pasting the first
bits from the generic version into the function since I am assuming
there is a performance benefit to doing it in assembler. It should be
a net win if you just add the accounting for constants.

Patch 9 could probably be a standalone patch or included in the more
mm centered set. However it would need to be redone to fix the
underlying issue rather than working around it by changing the
function called rather than fixing the function. No point in improving
it for one case when you can cover multiple cases with a single
change.