Hey folks, we have a node with 4 dual-port MT28908 Mellanox cards installed. We want to redirect the incoming traffic from an ingress port to an egress port on the same NIC using the bpf_redirect() helper function. Each ingress port redirects traffic of ca 25Gibps with a packet size of ca 1500Bytes. It is almost solely UDP multicast traffic. There are four multicast groups per bridge coming from 64 different source IPs. I raised the rx ring buffer size of the ingress ports to 8192 and aggregated the interrupts: ethtool -C ingress_port_i rx-frames 512 ethtool -C ingress_port_i rx-usecs 16 I still need to check whether these numbers make sense. The CPU utilization is moderate around 20-30%. On top of it we'd like to record the traffic streams using the bpf_ringbuf_output function. For now I write only into the ringbuffer to make the data available in user space. I have 16 different ringbuffer, one for each CPU. I am currently not sure how to enforce that the ringbuffer sits in the right NUMA node. Is there a way? numastat tells me that I have zero numa misses so that is possibly OK. In user land I start 16 threads pinned to the cores running a handler to process the ringbuffer content. Currently I only count packets. Doing it all cores are entirely utilized and I lose packets. perf top tells me (only CPU 8): PerfTop: 3868 irqs/sec kernel:95.9% exact: 97.6% lost: 0/0 drop: 0/0 [4000Hz cycles], (all, CPU: 8) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 25.72% 11804 [kernel] [k] xdp_do_redirect 8.73% 4008 [kernel] [k] memcpy 7.19% 3297 [kernel] [k] bq_enqueue 7.16% 3287 [kernel] [k] check_preemption_disabled 4.28% 1959 [kernel] [k] bpf_ringbuf_output 4.20% 1925 [kernel] [k] mlx5e_xdp_handle and in perf record I indeed find the memcopy issue + some load for mlx5e_napi_poll: --45.35%--__napi_poll mlx5e_napi_poll | |--36.80%--mlx5e_poll_rx_cq | | | |--35.64%--mlx5e_handle_rx_cqe_mpwrq | | | | | --35.56%--mlx5e_skb_from_cqe_mpwrq_linear | | | | | --34.57%--mlx5e_xdp_handle | | | | | |--32.15%--bpf_prog_82775e2abf7feec0_xdp_tap_ingress_prog | | | | | | | --31.20%--bpf_ringbuf_output | | | | | | | --30.69%--memcpy | | | | | | | --0.77%--asm_common_interrupt | | | common_interrupt | | | | | | | --0.72%--__common_interrupt | | | handle_edge_irq | | | | Is there any chance to improve the ringbuffer output? Or could I get the packets onto disks in any other way using bpf helper functions? Do I need to gather more or other information? Thank you for your help. Cheers, Henning