(answered inline, below) On Tue, Sep 20, 2022 at 3:17 AM Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> wrote: > > > (answered inline, below) > > On 19/09/2022 22.55, Adam Smith wrote: > > Hello, > > > > In trying to understand the differences in IRQ utilization and > > throughput when performing XDP_REDIRECT in a simple netfilter bridge > > on the Intel i40e, we have encountered behavior we are unable to > > explain and we would like advice on where to investigate next. > > > > The two questions we are seeking guidance for are: > > 1) Why does XDP in the i40e driver handle interrupts on multiple IRQs, > > while the same flows are serviced by a single IRQ without XDP > > (netfilter bridge)? > > > > Remember IRQ smp affinity is configurable via /proc/irq/ files. > Below bash code simply uses the queue number as the assigned CPU number. > > echo " --- Align IRQs: i40e ---" > # i40e have driver name as starting prefix, making it easier to "catch" > for F in /proc/irq/*/i40e*-TxRx-*/../smp_affinity_list; do > # Extract irqname e.g. "i40e-eth2-TxRx-1" > irqname=$(basename $(dirname $(dirname $F))) ; > # Substring pattern removal to extract Q-number > hwq_nr=${irqname#*-*-*-} > echo $hwq_nr > $F > #grep . -H $F; > done > > Thus we get this one-to-one mapping of Q-to-CPU number: > > $ grep -H . /proc/irq/*/i40e*-TxRx-*/../smp_affinity_list > /proc/irq/218/i40e-i40e1-TxRx-0/../smp_affinity_list:0 > /proc/irq/219/i40e-i40e1-TxRx-1/../smp_affinity_list:1 > /proc/irq/220/i40e-i40e1-TxRx-2/../smp_affinity_list:2 > /proc/irq/221/i40e-i40e1-TxRx-3/../smp_affinity_list:3 > /proc/irq/222/i40e-i40e1-TxRx-4/../smp_affinity_list:4 > /proc/irq/223/i40e-i40e1-TxRx-5/../smp_affinity_list:5 > /proc/irq/224/i40e-0000:04:00.0:fdir-TxRx-0/../smp_affinity_list:0 Apologies, I should have mentioned that IRQ affinity was already pinned via the recommended set_irq_affinity script from Intel driver tools. > > 2) Why does the i40e driver with XDP under load seemingly get faster > > when tracing is attached to functions inside the driver’s napi_poll > > loop? > > My theory is: Because you keep the CPU from going into sleep states. > > > Our working theory is that the i40e driver is not as efficient in > > interrupt handling when XDP is enabled. Something in napi_poll is > > looping too aggressively, and, when artificially slowed by attaching > > to various kprobes and tracepoints, the slightly delayed code becomes > > more efficient. > > > > Testing setup: > > > > So, the test setup is basically a forwarding scenario using bridging. > (It reminds me, we should add BPF bridge FIB lookup helpers... Cc lorenzo) > > > Without XDP, our iperf3 test utilizes almost 100% CPU on a single core > > to achieve approximately 9.42 Gbits/sec. Total hard IRQs over 10 > > seconds is as follows: > > i40e-enp1s0f1-TxRx-1 127k > > Iperf3 retransmissions are roughly 0. > > The key here is that your test utilizes almost 100% CPU on a single > core. From this info I know that the CPU isn't going into deep sleep > states. > > > > With simple XDP_REDIRECT programs installed on both interfaces, CPU > > usage drops to ~43% on two different cores (one significantly higher > > than the other), and hard IRQs over 10 seconds is as follows: > > i40e-enp1s0f0-TxRx-1 169k > > i40e-enp1s0f0-TxRx-2 82k > > To avoid the jumping between IRQs, you should configure the smp_affinity > as described above, BUT it will not solve the drop issue. As stated above, IRQs were pinned, which is what led us to question the difference between XDP & Linux bridge. > > i40e-enp1s0f1-TxRx-1 147k > > i40e-enp1s0f1-TxRx-2 235k > > Throughput in this case is only ~8.75 Gbits/sec, and iperf3 > > retransmissions number between 1k and 3k consistently. > > The XDP redirect is so fast that the CPU is bored and decides to dive > into deep sleep state levels. If the time it takes to wakeup again + > overhead of starting NAPI (hardirq->softirq) is too long, then packets > will be dropped due to overflowing hardware RX-queue. > > You can directly see the time/latency it takes to wake up from these > sleep states on your hardware from this grep command: > > $ grep -H . /sys/devices/system/cpu/cpu0/cpuidle/state*/latency > /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0 > /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2 > /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10 > /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:40 > /sys/devices/system/cpu/cpu0/cpuidle/state4/latency:133 > > As explained in[1] you can calculate back how many bytes are able to > arrive at a given link speed when sleeping e.g. 133 usec, and then based > on the expected packet size figure out if the default 512 slots RX-queue > for i40e is large enough. > > [1] > https://github.com/torvalds/linux/blob/v6.0-rc6/samples/bpf/xdp_redirect_cpu_user.c#L331-L346 RX-queue size was set to 4096 for our tests, which is the maximum available on the X710. > > When we use bpftrace to attach multiple BPF programs to i40e functions > > involved in XDP (e.g., `bpftrace -e ‘tracepoint:i40e:i40e_clean_rx_irq > > {} kprobe:i40e_xmit_xdp_ring {}’), retransmissions drop to 0, > > throughput increases to 9.4 Gbits/sec, and CPU utilization on the > > busier CPU increases to ~73%. Hard IRQs are similar to the > > XDP_REDIRECT IRQs above. > > > > Attaching traces should not logically result in a throughput increase. > > > > Any insight or guidance would be greatly appreciated! > > Solution#1: Sysadm can configured system to avoid deep-sleep via: > > # tuned-adm profile network-latency > > Solution#2: Can be combined with increasing RX-queue size via: > > # ethtool -G i40e1 rx 2048 > > --Jesper Thank you very much! Changing CPU sleep behaviors explained our 2nd issue from above with retransmissions and slower speeds without profiling attached. We are still at a loss as to the differences in number of IRQs used between XDP & bridge mode, but performance is now aligned with our expectations. In rechecking these numbers after tuning the CPU with tuned-adm, we did notice that XDP generates roughly 10x the number of hard irqs compared to non-XDP bridge mode, but only on one interrupt/core. See: Non-XDP Bridge $ sudo hardirqs -C 10 1 Tracing hard irq events... Hit Ctrl-C to end. HARDIRQ TOTAL_count [...] i40e-enp1s0f1-TxRx-1 118820 XDP (same network flow) $ sudo hardirqs -C 10 1 Tracing hard irq events... Hit Ctrl-C to end. HARDIRQ TOTAL_count [...] i40e-enp1s0f0-TxRx-2 79071 i40e-enp1s0f0-TxRx-1 106929 i40e-enp1s0f1-TxRx-2 993162 i40e-enp1s0f1-TxRx-1 108362 Is it possible that we are seeing hard interrupts from both the RX & TX packets under XDP? In non-XDP, we notice that we are only seeing one network interface producing hard interrupts and we are assuming that the other interface must be serviced fully by polling. Thank you again! Adam Smith