Re: Questions about IRQ utilization and throughput with XDP_REDIRECT on Intel i40e

Jesper Dangaard Brouer <jbrouer@xxxxxxxxxx> · Fri, 23 Sep 2022 18:55:17 +0200

On 20/09/2022 19.37, Adam Smith wrote:

(answered inline, below)

On Tue, Sep 20, 2022 at 3:17 AM Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx> wrote:

(answered inline, below)

On 19/09/2022 22.55, Adam Smith wrote:

Hello,

In trying to understand the differences in IRQ utilization and
throughput when performing XDP_REDIRECT in a simple netfilter bridge
on the Intel i40e, we have encountered behavior we are unable to
explain and we would like advice on where to investigate next.

The two questions we are seeking guidance for are:
1) Why does XDP in the i40e driver handle interrupts on multiple IRQs,
while the same flows are serviced by a single IRQ without XDP
(netfilter bridge)?

Remember IRQ smp affinity is configurable via /proc/irq/ files.
Below bash code simply uses the queue number as the assigned CPU number.

echo " --- Align IRQs: i40e ---"
# i40e have driver name as starting prefix, making it easier to "catch"
for F in /proc/irq/*/i40e*-TxRx-*/../smp_affinity_list; do
     # Extract irqname e.g. "i40e-eth2-TxRx-1"
     irqname=$(basename $(dirname $(dirname $F))) ;
     # Substring pattern removal to extract Q-number
     hwq_nr=${irqname#*-*-*-}
     echo $hwq_nr > $F
     #grep . -H $F;
done

Thus we get this one-to-one mapping of Q-to-CPU number:

   $ grep -H . /proc/irq/*/i40e*-TxRx-*/../smp_affinity_list
   /proc/irq/218/i40e-i40e1-TxRx-0/../smp_affinity_list:0
   /proc/irq/219/i40e-i40e1-TxRx-1/../smp_affinity_list:1
   /proc/irq/220/i40e-i40e1-TxRx-2/../smp_affinity_list:2
   /proc/irq/221/i40e-i40e1-TxRx-3/../smp_affinity_list:3
   /proc/irq/222/i40e-i40e1-TxRx-4/../smp_affinity_list:4
   /proc/irq/223/i40e-i40e1-TxRx-5/../smp_affinity_list:5
   /proc/irq/224/i40e-0000:04:00.0:fdir-TxRx-0/../smp_affinity_list:0

Apologies, I should have mentioned that IRQ affinity was already
pinned via the recommended set_irq_affinity script from Intel driver
tools.

It sounds like a bug (Cc Magnus) if you had pinned the IRQ affinity and 

then traffic migrates around anyway.

There can be two possible setup issue that cause this:

(#1) When loading XDP some drivers reset too much of the link
     config (like ixgbe), which could lead to smp_affinity getting reset
     to defaults. (Check setting with above grep -H . )

(#2) The i40e NICs hardware ATR/Flow-Director could be the one that
     moves traffic in unexpected ways.  You can disable it via ethtool
     running this command:

   ethtool -K i40e1 ntuple-filters off

(Cc Federico as they mention ATR/Flow-Director in their paper)

2) Why does the i40e driver with XDP under load seemingly get faster
when tracing is attached to functions inside the driver’s napi_poll
loop?

My theory is: Because you keep the CPU from going into sleep states.

Our working theory is that the i40e driver is not as efficient in
interrupt handling when XDP is enabled. Something in napi_poll is
looping too aggressively, and, when artificially slowed by attaching
to various kprobes and tracepoints, the slightly delayed code becomes
more efficient.

Testing setup:

So, the test setup is basically a forwarding scenario using bridging.
(It reminds me, we should add BPF bridge FIB lookup helpers... Cc lorenzo)

Without XDP, our iperf3 test utilizes almost 100% CPU on a single core
to achieve approximately 9.42 Gbits/sec. Total hard IRQs over 10
seconds is as follows:
i40e-enp1s0f1-TxRx-1            127k
Iperf3 retransmissions are roughly 0.

The key here is that your test utilizes almost 100% CPU on a single
core.  From this info I know that the CPU isn't going into deep sleep
states.

With simple XDP_REDIRECT programs installed on both interfaces, CPU
usage drops to ~43% on two different cores (one significantly higher
than the other), and hard IRQs over 10 seconds is as follows:
i40e-enp1s0f0-TxRx-1            169k
i40e-enp1s0f0-TxRx-2              82k

To avoid the jumping between IRQs, you should configure the smp_affinity
as described above, BUT it will not solve the drop issue.

As stated above, IRQs were pinned, which is what led us to question
the difference between XDP & Linux bridge.

i40e-enp1s0f1-TxRx-1            147k
i40e-enp1s0f1-TxRx-2            235k
Throughput in this case is only ~8.75 Gbits/sec, and iperf3
retransmissions number between 1k and 3k consistently.

The XDP redirect is so fast that the CPU is bored and decides to dive
into deep sleep state levels.  If the time it takes to wakeup again +
overhead of starting NAPI (hardirq->softirq) is too long, then packets
will be dropped due to overflowing hardware RX-queue.

You can directly see the time/latency it takes to wake up from these
sleep states on your hardware from this grep command:

   $ grep -H . /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
   /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
   /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
   /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
   /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:40
   /sys/devices/system/cpu/cpu0/cpuidle/state4/latency:133

As explained in[1] you can calculate back how many bytes are able to
arrive at a given link speed when sleeping e.g. 133 usec, and then based
on the expected packet size figure out if the default 512 slots RX-queue
for i40e is large enough.

[1]
https://github.com/torvalds/linux/blob/v6.0-rc6/samples/bpf/xdp_redirect_cpu_user.c#L331-L346

RX-queue size was set to 4096 for our tests, which is the maximum
available on the X710.

When we use bpftrace to attach multiple BPF programs to i40e functions
involved in XDP (e.g., `bpftrace -e ‘tracepoint:i40e:i40e_clean_rx_irq
{} kprobe:i40e_xmit_xdp_ring {}’), retransmissions drop to 0,
throughput increases to 9.4 Gbits/sec, and CPU utilization on the
busier CPU increases to ~73%. Hard IRQs are similar to the
XDP_REDIRECT IRQs above.

Attaching traces should not logically result in a throughput increase.

Any insight or guidance would be greatly appreciated!

Solution#1: Sysadm can configured system to avoid deep-sleep via:

   # tuned-adm profile network-latency

Solution#2: Can be combined with increasing RX-queue size via:

   # ethtool -G i40e1 rx 2048

--Jesper

Thank you very much! Changing CPU sleep behaviors explained our 2nd
issue from above with retransmissions and slower speeds without
profiling attached.

We are still at a loss as to the differences in number of IRQs used
between XDP & bridge mode, but performance is now aligned with our
expectations. In rechecking these numbers after tuning the CPU with
tuned-adm, we did notice that XDP generates roughly 10x the number of
hard irqs compared to non-XDP bridge mode, but only on one
interrupt/core. See:

Good to hear the tuned-adm trick worked for you.

Non-XDP Bridge
$ sudo hardirqs -C 10 1

Assume this is 1 sec sampling.

Tracing hard irq events... Hit Ctrl-C to end.
HARDIRQ                    TOTAL_count
[...]
i40e-enp1s0f1-TxRx-1     118820

Kind weird enp1s0f0 doesn't show up.

Processing 10Gbit/s with MTU 1500 bytes packets means 833,333 packet per 

sec will be arriving (10*10^9/8/1500).

Thus, NAPI pool must be doing some bulk processing, as your hardirq's 

are 118.820/sec.

I have a number of "napi_monitor" tools (I've implemented myself) to see 

what level of bulking you get.

The @napi_bulk output from this bpftrace script shows you bulking:

https://github.com/xdp-project/xdp-project/blob/master/areas/latency/napi_monitor.bt

Script do much more (latency measurements) so you likely just want to 

use this oneliner:

 bpftrace -e 'tracepoint:napi:napi_poll { @napi_bulk=lhist(args->work, 

0, 64, 4); }'

XDP (same network flow)
$ sudo hardirqs -C 10 1
Tracing hard irq events... Hit Ctrl-C to end.
HARDIRQ                    TOTAL_count
[...]
i40e-enp1s0f0-TxRx-2       79071
i40e-enp1s0f0-TxRx-1     106929
i40e-enp1s0f1-TxRx-2     993162
i40e-enp1s0f1-TxRx-1     108362

Ignoring the weird RX-queue spread.

Only looking at i40e-enp1s0f1-TxRx-2 with 993,162 IRQ/sec.

This is way too high, as we just learned 833Kpps should be the packet 

numbers.  As this is a TCP flow going though the bridging box, then we 

likely also need to account for the ACK packets flowing in the other 

direction, thus we can get above the 833Kpps.

This could indicate that NAPI loop only process a single packet at the 

time, which is weird.

Could you check what NAPI is doing, again with oneliner:

 bpftrace -e 'tracepoint:napi:napi_poll { @napi_bulk=lhist(args->work, 

0, 64, 4); }'

Is it possible that we are seeing hard interrupts from both the RX &

TX packets under XDP? 

NIC drivers often tries to do the work of the TX DMA completion
interrupts while processing RX packets anyhow, thus often the TX IRQ are
significant lower if this happens.

In non-XDP, we notice that we are only seeing
one network interface producing hard interrupts and we are assuming
that the other interface must be serviced fully by polling.

That is kind of weird, not seeing any IRQs for enp1s0f0.

You should use this improved version of the oneline to also get ifnames:

 sudo bpftrace -e 'tracepoint:napi:napi_poll { 

@napi_bulk[str(args->dev_name)]=lhist(args->work, 0, 64, 4); }'

--Jesper