Questions about IRQ utilization and throughput with XDP_REDIRECT on Intel i40e

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

In trying to understand the differences in IRQ utilization and
throughput when performing XDP_REDIRECT in a simple netfilter bridge
on the Intel i40e, we have encountered behavior we are unable to
explain and we would like advice on where to investigate next.

The two questions we are seeking guidance for are:
1) Why does XDP in the i40e driver handle interrupts on multiple IRQs,
while the same flows are serviced by a single IRQ without XDP
(netfilter bridge)?

2) Why does the i40e driver with XDP under load seemingly get faster
when tracing is attached to functions inside the driver’s napi_poll
loop?

Our working theory is that the i40e driver is not as efficient in
interrupt handling when XDP is enabled. Something in napi_poll is
looping too aggressively, and, when artificially slowed by attaching
to various kprobes and tracepoints, the slightly delayed code becomes
more efficient.

Testing setup:

Without XDP, our iperf3 test utilizes almost 100% CPU on a single core
to achieve approximately 9.42 Gbits/sec. Total hard IRQs over 10
seconds is as follows:
i40e-enp1s0f1-TxRx-1            127k
Iperf3 retransmissions are roughly 0.

With simple XDP_REDIRECT programs installed on both interfaces, CPU
usage drops to ~43% on two different cores (one significantly higher
than the other), and hard IRQs over 10 seconds is as follows:
i40e-enp1s0f0-TxRx-1            169k
i40e-enp1s0f0-TxRx-2              82k
i40e-enp1s0f1-TxRx-1            147k
i40e-enp1s0f1-TxRx-2            235k
Throughput in this case is only ~8.75 Gbits/sec, and iperf3
retransmissions number between 1k and 3k consistently.

When we use bpftrace to attach multiple BPF programs to i40e functions
involved in XDP (e.g., `bpftrace -e ‘tracepoint:i40e:i40e_clean_rx_irq
{} kprobe:i40e_xmit_xdp_ring {}’), retransmissions drop to 0,
throughput increases to 9.4 Gbits/sec, and CPU utilization on the
busier CPU increases to ~73%. Hard IRQs are similar to the
XDP_REDIRECT IRQs above.

Attaching traces should not logically result in a throughput increase.

Any insight or guidance would be greatly appreciated!

Adam Smith




[Index of Archives]     [Linux Networking Development]     [Fedora Linux Users]     [Linux SCTP]     [DCCP]     [Gimp]     [Yosemite Campsites]

  Powered by Linux