On Tue, Jul 12, 2022 at 11:19:11PM +0200, Toke Høiland-Jørgensen wrote: > Adam Smith <hrotsvit@xxxxxxxxx> writes: > > > Hello, > > > > I have a question regarding bpf_redirect/bpf_redirect_map and latency > > that we are seeing in a test. The environment is as follows: > > > > - Debian Bullseye, running 5.18.0-0.bpo.1-amd64 kernel from > > Bullseye-backports (Also tested on 5.16) > > - Intel Xeon X3430 @ 2.40GHz. 4 cores, no HT > > - Intel X710-DA2 using i40e driver included with the kernel. > > - Both interfaces (enp1s0f0 and enps0f1) in a simple netfilter bridge. > > - Ring parameters for rx/tx are both set to the max of 4096, with no > > other nic-specific parameters changed. > > > > Each interface has 4 combined IRQs, pinned per set_irq_affinity. > > `irqbalanced` is not installed. > > > > Traffic is generated by another directly attached machine via iperf3 > > 3.9 (`iperf3 -c -t 0 192.168.1.3 --bidir`) to a directly attached > > server on the other side. > > > > The server in question does nothing more than forward packets as a > > transparent bridge. > > > > An XDP program is installed on f0 to redirect to f1, and f1 to > > redirect to f0. I have tried programs that simply call > > `bpf_redirect()`, as well as programs that share a device map and call > > `bpf_redirect_map()`, with idententical results. > > > > When channel parameters for each interface are reduced to a single IRQ > > via `ethtool -L enp1s0f0 combined 1`, and both interface IRQs are > > bound to the same CPU core via smp_affinity, XDP produces improved > > bitrate with reduced CPU utilization over non-XDP tests: > > - Stock netfilter bridge: 9.11 Gbps in both directions at 98% > > utilization of pinned core. > > - XDP: Approximately 9.18 Gbps in both directions at 50% utilization > > of pinned core. > > > > However, when multiple cores are engaged (combined 4, with > > set_irq_affinity), XDP processes markedly fewer packets per second > > (950,000 vs approximately 1.6 million). iperf3 also shows a large > > number of retransmissions in its output regardless of CPU engagement > > (approximately 6,500 with XDP over 2 minutes vs 850 with single core > > tests). > > > > This is a sample taken from linux/samples xdp_monitor showing > > redirection and transmission of packets with XDP engaged: > > > > Summary 944,508 redir/s 0 > > err,drop/s 944,506 xmit/s > > kthread 0 pkt/s > > 0 drop/s 0 sched > > redirect total 944,508 redir/s > > cpu:0 470,148 redir/s > > cpu:2 15,078 redir/s > > cpu:3 459,282 redir/s > > redirect_err 0 error/s > > xdp_exception 0 hit/s > > devmap_xmit total 944,506 xmit/s 0 > > drop/s 0 drv_err/s > > cpu:0 470,148 xmit/s > > 0 drop/s 0 drv_err/s > > cpu:2 15,078 xmit/s > > 0 drop/s 0 drv_err/s > > cpu:3 459,280 xmit/s > > 0 drop/s 0 drv_err/s > > xmit enp1s0f0->enp1s0f1 485,249 xmit/s 0 drop/s > > 0 drv_err/s > > cpu:0 470,172 xmit/s > > 0 drop/s 0 drv_err/s > > cpu:2 15,078 xmit/s > > 0 drop/s 0 drv_err/s > > xmit enp1s0f1->enp1s0f0 459,263 xmit/s 0 drop/s > > 0 drv_err/s > > cpu:3 459,263 xmit/s > > 0 drop/s 0 drv_err/s > > > > Our current hypothesis is that this is a CPU affinity issue. We > > believe a different core is being used for transmission. In efforts to > > prove this, how can we successfully measure if bpf_redirect() is > > causing packets to be transmitted by a different core than they were > > received by? We are still trying to understand how bpf_redirect() > > selects which core/IRQ to transmit on and would appreciate any insight > > or followup material to research. > > There is no mechanism in bpf_redirect() to switch CPUs (outside of > cpumap). When you call XDP_REDIRECT, the frame will be added to a > per-device per-CPU flush list, which will be flushed (on that same CPU). > The i40e allocates separate rings for XDP, though, and not sure how it > does that, so maybe those are what's missing. You should be able to see > drops in the output if that's what's going on; and the packets should > still be processed by XDP. > > So sounds more like the hardware configuration is causing packet loss > before it even hits XDP. Do you see anything in the ethtool stats that > might explain where packets are being dropped? I don't know how irqs are exactly bound to which cpus but most probably this is driver issue as Toke is saying. i40e_xdp_xmit() uses smp_processor_id() as an index to xdp rings array, so if you limit queue count to 4 and bound irq to say cpu 10, you'll return with -ENXIO as queue_index will be >= than vsi->num_queue_pairs. I believe that such issues were addressed on ice driver. In there, xdp rings array is sized to num_possible_cpus() regardless of user's queue count setting and smp_processor_id() can be safely used. Adam, could you skip the `ethtool -L $IFACE combined 4` and work with your 4 flows to see if there is any difference? Maciej > > -Toke >