Hi, Maciej - in this particular situation, `combined 4` was selected because the CPU being used only has 4 cores, and 4 is what the driver auto-selects upon boot as well. Toke - we are seeing drops from `port.rx_dropped` on both interfaces: 1 IRQ, same CPU, no XDP -- port.rx_dropped: 0 pps / interface 1 IRQ, same CPU, XDP_REDIRECT -- port.rx_dropped: appx. 50-75 pps / interface 4 IRQ, no XDP -- port.rx_dropped: appx. 25-50 pps / interface 4 IRQ, XDP_REDIRECT -- port.rx_dropped: appx. 2000 pps / interface `rx_dropped` remains 0 in all cases. Of note, when XDP is not used in a 4 IRQ setup, CPU load is shown on 2 cores, related to the IRQs handling the 2 primary traffic flows generated by bidirectional iperf3 (a byproduct of RSS). When XDP is used, the load on those two cores drops significantly, but we see an increased load on a 3rd core. Thanks! Adam On Wed, Jul 13, 2022 at 5:17 AM Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx> wrote: > > On Tue, Jul 12, 2022 at 11:19:11PM +0200, Toke Høiland-Jørgensen wrote: > > Adam Smith <hrotsvit@xxxxxxxxx> writes: > > > > > Hello, > > > > > > I have a question regarding bpf_redirect/bpf_redirect_map and latency > > > that we are seeing in a test. The environment is as follows: > > > > > > - Debian Bullseye, running 5.18.0-0.bpo.1-amd64 kernel from > > > Bullseye-backports (Also tested on 5.16) > > > - Intel Xeon X3430 @ 2.40GHz. 4 cores, no HT > > > - Intel X710-DA2 using i40e driver included with the kernel. > > > - Both interfaces (enp1s0f0 and enps0f1) in a simple netfilter bridge. > > > - Ring parameters for rx/tx are both set to the max of 4096, with no > > > other nic-specific parameters changed. > > > > > > Each interface has 4 combined IRQs, pinned per set_irq_affinity. > > > `irqbalanced` is not installed. > > > > > > Traffic is generated by another directly attached machine via iperf3 > > > 3.9 (`iperf3 -c -t 0 192.168.1.3 --bidir`) to a directly attached > > > server on the other side. > > > > > > The server in question does nothing more than forward packets as a > > > transparent bridge. > > > > > > An XDP program is installed on f0 to redirect to f1, and f1 to > > > redirect to f0. I have tried programs that simply call > > > `bpf_redirect()`, as well as programs that share a device map and call > > > `bpf_redirect_map()`, with idententical results. > > > > > > When channel parameters for each interface are reduced to a single IRQ > > > via `ethtool -L enp1s0f0 combined 1`, and both interface IRQs are > > > bound to the same CPU core via smp_affinity, XDP produces improved > > > bitrate with reduced CPU utilization over non-XDP tests: > > > - Stock netfilter bridge: 9.11 Gbps in both directions at 98% > > > utilization of pinned core. > > > - XDP: Approximately 9.18 Gbps in both directions at 50% utilization > > > of pinned core. > > > > > > However, when multiple cores are engaged (combined 4, with > > > set_irq_affinity), XDP processes markedly fewer packets per second > > > (950,000 vs approximately 1.6 million). iperf3 also shows a large > > > number of retransmissions in its output regardless of CPU engagement > > > (approximately 6,500 with XDP over 2 minutes vs 850 with single core > > > tests). > > > > > > This is a sample taken from linux/samples xdp_monitor showing > > > redirection and transmission of packets with XDP engaged: > > > > > > Summary 944,508 redir/s 0 > > > err,drop/s 944,506 xmit/s > > > kthread 0 pkt/s > > > 0 drop/s 0 sched > > > redirect total 944,508 redir/s > > > cpu:0 470,148 redir/s > > > cpu:2 15,078 redir/s > > > cpu:3 459,282 redir/s > > > redirect_err 0 error/s > > > xdp_exception 0 hit/s > > > devmap_xmit total 944,506 xmit/s 0 > > > drop/s 0 drv_err/s > > > cpu:0 470,148 xmit/s > > > 0 drop/s 0 drv_err/s > > > cpu:2 15,078 xmit/s > > > 0 drop/s 0 drv_err/s > > > cpu:3 459,280 xmit/s > > > 0 drop/s 0 drv_err/s > > > xmit enp1s0f0->enp1s0f1 485,249 xmit/s 0 drop/s > > > 0 drv_err/s > > > cpu:0 470,172 xmit/s > > > 0 drop/s 0 drv_err/s > > > cpu:2 15,078 xmit/s > > > 0 drop/s 0 drv_err/s > > > xmit enp1s0f1->enp1s0f0 459,263 xmit/s 0 drop/s > > > 0 drv_err/s > > > cpu:3 459,263 xmit/s > > > 0 drop/s 0 drv_err/s > > > > > > Our current hypothesis is that this is a CPU affinity issue. We > > > believe a different core is being used for transmission. In efforts to > > > prove this, how can we successfully measure if bpf_redirect() is > > > causing packets to be transmitted by a different core than they were > > > received by? We are still trying to understand how bpf_redirect() > > > selects which core/IRQ to transmit on and would appreciate any insight > > > or followup material to research. > > > > There is no mechanism in bpf_redirect() to switch CPUs (outside of > > cpumap). When you call XDP_REDIRECT, the frame will be added to a > > per-device per-CPU flush list, which will be flushed (on that same CPU). > > The i40e allocates separate rings for XDP, though, and not sure how it > > does that, so maybe those are what's missing. You should be able to see > > drops in the output if that's what's going on; and the packets should > > still be processed by XDP. > > > > So sounds more like the hardware configuration is causing packet loss > > before it even hits XDP. Do you see anything in the ethtool stats that > > might explain where packets are being dropped? > > I don't know how irqs are exactly bound to which cpus but most probably > this is driver issue as Toke is saying. > > i40e_xdp_xmit() uses smp_processor_id() as an index to xdp rings array, so > if you limit queue count to 4 and bound irq to say cpu 10, you'll return > with -ENXIO as queue_index will be >= than vsi->num_queue_pairs. > > I believe that such issues were addressed on ice driver. In there, xdp > rings array is sized to num_possible_cpus() regardless of user's queue > count setting and smp_processor_id() can be safely used. > > Adam, could you skip the `ethtool -L $IFACE combined 4` and work with your > 4 flows to see if there is any difference? > > Maciej > > > > > -Toke > >