Hi Olek, Here are the results. On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > From: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx> > > Date: Tue, 22 Oct 2024 17:51:43 +0200 > > > >> From: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx> > >> Date: Wed, 9 Oct 2024 14:50:42 +0200 > >> > >>> From: Lorenzo Bianconi <lorenzo@xxxxxxxxxx> > >>> Date: Wed, 9 Oct 2024 14:47:58 +0200 > >>> > >>>>> From: Lorenzo Bianconi <lorenzo@xxxxxxxxxx> > >>>>> Date: Wed, 9 Oct 2024 12:46:00 +0200 > >>>>> > >>>>>>> Hi Lorenzo, > >>>>>>> > >>>>>>> On Mon, Sep 16, 2024 at 12:13:42PM GMT, Lorenzo Bianconi wrote: > >>>>>>>> Add GRO support to cpumap codebase moving the cpu_map_entry kthread to a > >>>>>>>> NAPI-kthread pinned on the selected cpu. > >>>>>>>> > >>>>>>>> Changes in rfc v2: > >>>>>>>> - get rid of dummy netdev dependency > >>>>>>>> > >>>>>>>> Lorenzo Bianconi (3): > >>>>>>>> net: Add napi_init_for_gro routine > >>>>>>>> net: add napi_threaded_poll to netdevice.h > >>>>>>>> bpf: cpumap: Add gro support > >>>>>>>> > >>>>>>>> include/linux/netdevice.h | 3 + > >>>>>>>> kernel/bpf/cpumap.c | 123 ++++++++++++++++---------------------- > >>>>>>>> net/core/dev.c | 27 ++++++--- > >>>>>>>> 3 files changed, 73 insertions(+), 80 deletions(-) > >>>>>>>> > >>>>>>>> -- > >>>>>>>> 2.46.0 > >>>>>>>> > >>>>>>> > >>>>>>> Sorry about the long delay - finally caught up to everything after > >>>>>>> conferences. > >>>>>>> > >>>>>>> I re-ran my synthetic tests (including baseline). v2 is somehow showing > >>>>>>> 2x bigger gains than v1 (~30% vs ~14%) for tcp_stream. Again, the only > >>>>>>> variable I changed is kernel version - steering prog is active for both. > >>>>>>> > >>>>>>> > >>>>>>> Baseline (again) > >>>>>>> > >>>>>>> ./tcp_rr -c -H $TASK_IP -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $TASK_IP -T8 -F16 -l30 > >>>>>>> > >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >>>>>>> Run 1 2560252 0.00009087 0.00010495 0.00011647 Run 1 15479.31 > >>>>>>> Run 2 2665517 0.00008575 0.00010239 0.00013311 Run 2 15162.48 > >>>>>>> Run 3 2755939 0.00008191 0.00010367 0.00012287 Run 3 14709.04 > >>>>>>> Run 4 2595680 0.00008575 0.00011263 0.00012671 Run 4 15373.06 > >>>>>>> Run 5 2841865 0.00007999 0.00009471 0.00012799 Run 5 15234.91 > >>>>>>> Average 2683850.6 0.000084854 0.00010367 0.00012543 Average 15191.76 > >>>>>>> > >>>>>>> cpumap NAPI patches v2 > >>>>>>> > >>>>>>> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > >>>>>>> Run 1 2577838 0.00008575 0.00012031 0.00013695 Run 1 19914.56 > >>>>>>> Run 2 2729237 0.00007551 0.00013311 0.00017663 Run 2 20140.92 > >>>>>>> Run 3 2689442 0.00008319 0.00010495 0.00013311 Run 3 19887.48 > >>>>>>> Run 4 2862366 0.00008127 0.00009471 0.00010623 Run 4 19374.49 > >>>>>>> Run 5 2700538 0.00008319 0.00010367 0.00012799 Run 5 19784.49 > >>>>>>> Average 2711884.2 0.000081782 0.00011135 0.000136182 Average 19820.388 > >>>>>>> Delta 1.04% -3.62% 7.41% 8.57% 30.47% > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Daniel > >>>>>> > >>>>>> Hi Daniel, > >>>>>> > >>>>>> cool, thx for testing it. > >>>>>> > >>>>>> @Olek: how do we want to proceed on it? Are you still working on it or do you want me > >>>>>> to send a regular patch for it? > >>>>> > >>>>> Hi, > >>>>> > >>>>> I had a small vacation, sorry. I'm starting working on it again today. > >>>> > >>>> ack, no worries. Are you going to rebase the other patches on top of it > >>>> or are you going to try a different approach? > >>> > >>> I'll try the approach without NAPI as Kuba asks and let Daniel test it, > >>> then we'll see. > >> > >> For now, I have the same results without NAPI as with your series, so > >> I'll push it soon and let Daniel test. > >> > >> (I simply decoupled GRO and NAPI and used the former in cpumap, but the > >> kthread logic didn't change) > >> > >>> > >>> BTW I'm curious how he got this boost on v2, from what I see you didn't > >>> change the implementation that much? > > > > Hi Daniel, > > > > Sorry for the delay. Please test [0]. > > > > [0] https://github.com/alobakin/linux/commits/cpumap-old > > > > Thanks, > > Olek > > Ack. Will do probably early next week. > Baseline (again) Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 cpumap v2 Olek Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 Delta 0.92% -0.53% 0.33% 0.85% -41.32% It's very interesting that we see -40% tput w/ the patches. I went back and double checked and it seems the numbers are right. Here's the some output from some profiles I took with: perf record -e cycles:k -a -- sleep 10 perf --no-pager diff perf.data.baseline perf.data.withpatches > ... # Event 'cycles:k' # Baseline Delta Abs Shared Object Symbol 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter 3.57% -2.56% bpf_prog_954ab9c8c8b5e42f_latency [k] bpf_prog_954ab9c8c8b5e42f_latency +2.22% bpf_prog_5c74b34eb24d5c9b_steering [k] bpf_prog_5c74b34eb24d5c9b_steering 2.61% -1.88% [kernel.kallsyms] [k] __skb_datagram_iter 0.55% +1.53% [kernel.kallsyms] [k] acpi_processor_ffh_cstate_enter 4.52% -1.46% [kernel.kallsyms] [k] read_tsc 0.34% +1.42% [kernel.kallsyms] [k] __slab_free 0.97% +1.18% [kernel.kallsyms] [k] do_idle 1.35% +1.17% [kernel.kallsyms] [k] cpuidle_enter_state 1.89% -1.15% [kernel.kallsyms] [k] tcp_ack 2.08% +1.14% [kernel.kallsyms] [k] _raw_spin_lock +1.13% <redacted> 0.22% +1.02% [kernel.kallsyms] [k] __sock_wfree 2.23% -1.02% [kernel.kallsyms] [k] bpf_dynptr_slice 0.00% +0.98% [kernel.kallsyms] [k] tcp6_gro_receive 2.91% -0.98% [kernel.kallsyms] [k] csum_partial 0.62% +0.94% [kernel.kallsyms] [k] skb_release_data +0.81% [kernel.kallsyms] [k] memset 0.16% +0.74% [kernel.kallsyms] [k] bnxt_tx_int 0.00% +0.74% [kernel.kallsyms] [k] dev_gro_receive 0.36% +0.74% [kernel.kallsyms] [k] __tcp_transmit_skb +0.72% [kernel.kallsyms] [k] tcp_gro_receive 1.10% -0.66% [kernel.kallsyms] [k] ep_poll_callback 1.52% -0.65% [kernel.kallsyms] [k] page_pool_put_unrefed_netmem 0.75% -0.57% [kernel.kallsyms] [k] bnxt_rx_pkt 1.10% +0.56% [kernel.kallsyms] [k] native_sched_clock 0.16% +0.53% <redacted> 0.83% -0.53% [kernel.kallsyms] [k] skb_try_coalesce 0.60% +0.53% [kernel.kallsyms] [k] eth_type_trans 1.65% -0.51% [kernel.kallsyms] [k] _raw_spin_lock_irqsave 0.14% +0.50% [kernel.kallsyms] [k] bnxt_start_xmit 0.54% -0.48% [kernel.kallsyms] [k] __skb_frag_unref 0.91% +0.48% [cls_bpf] [k] 0x0000000000000010 0.00% +0.47% [kernel.kallsyms] [k] ipv6_gro_receive 0.76% -0.45% [kernel.kallsyms] [k] tcp_rcv_established 0.94% -0.45% [kernel.kallsyms] [k] __inet6_lookup_established 0.31% +0.43% [kernel.kallsyms] [k] __sched_text_start 0.21% +0.43% [kernel.kallsyms] [k] poll_idle 0.91% -0.42% [kernel.kallsyms] [k] tcp_try_coalesce 0.91% -0.42% [kernel.kallsyms] [k] kmem_cache_free 1.13% +0.42% [kernel.kallsyms] [k] __bnxt_poll_work 0.48% -0.41% [kernel.kallsyms] [k] tcp_urg +0.39% [kernel.kallsyms] [k] memcpy 0.51% -0.38% [kernel.kallsyms] [k] _raw_read_unlock_irqrestore +0.38% [kernel.kallsyms] [k] __skb_gro_checksum_complete +0.37% [kernel.kallsyms] [k] irq_entries_start 0.16% +0.36% [kernel.kallsyms] [k] bpf_sk_storage_get 0.62% -0.36% [kernel.kallsyms] [k] page_pool_refill_alloc_cache 0.08% +0.35% [kernel.kallsyms] [k] ip6_finish_output2 0.14% +0.34% [kernel.kallsyms] [k] bnxt_poll_p5 0.06% +0.33% [sch_fq] [k] 0x0000000000000020 0.04% +0.32% [kernel.kallsyms] [k] __dev_queue_xmit 0.75% -0.32% [kernel.kallsyms] [k] __xdp_build_skb_from_frame 0.67% -0.31% [kernel.kallsyms] [k] sock_def_readable 0.05% +0.31% [kernel.kallsyms] [k] netif_skb_features +0.30% [kernel.kallsyms] [k] tcp_gro_pull_header 0.49% -0.29% [kernel.kallsyms] [k] napi_pp_put_page 0.18% +0.29% [kernel.kallsyms] [k] call_function_single_prep_ipi 0.40% -0.28% [kernel.kallsyms] [k] _raw_read_lock_irqsave 0.11% +0.27% [kernel.kallsyms] [k] raw6_local_deliver 0.18% +0.26% [kernel.kallsyms] [k] ip6_dst_check 0.42% -0.26% [kernel.kallsyms] [k] netif_receive_skb_list_internal 0.05% +0.26% [kernel.kallsyms] [k] __qdisc_run 0.75% +0.25% [kernel.kallsyms] [k] __build_skb_around 0.05% +0.25% [kernel.kallsyms] [k] htab_map_hash 0.09% +0.24% [kernel.kallsyms] [k] net_rx_action 0.07% +0.23% <redacted> 0.45% -0.23% [kernel.kallsyms] [k] migrate_enable 0.48% -0.23% [kernel.kallsyms] [k] mem_cgroup_charge_skmem 0.26% +0.23% [kernel.kallsyms] [k] __switch_to 0.15% +0.22% [kernel.kallsyms] [k] sock_rfree 0.30% -0.22% [kernel.kallsyms] [k] tcp_add_backlog <snip> 5.68% bpf_prog_17fea1bb6503ed98_steering [k] bpf_prog_17fea1bb6503ed98_steering 2.10% [kernel.kallsyms] [k] __skb_checksum_complete 0.71% [kernel.kallsyms] [k] __memset 0.54% [kernel.kallsyms] [k] __memcpy 0.18% [kernel.kallsyms] [k] __irqentry_text_start <snip> Please let me know if you want me to collect any other data. Thanks, Daniel