Re: Questions about IRQ utilization and throughput with XDP_REDIRECT on Intel i40e

Adam Smith <hrotsvit@xxxxxxxxx> · Fri, 23 Sep 2022 16:28:31 -0500

On Fri, Sep 23, 2022 at 11:55 AM Jesper Dangaard Brouer
<jbrouer@xxxxxxxxxx> wrote:
>
>
> On 20/09/2022 19.37, Adam Smith wrote:
> > (answered inline, below)
> >
> > On Tue, Sep 20, 2022 at 3:17 AM Jesper Dangaard Brouer
> > <jbrouer@xxxxxxxxxx> wrote:
> >>
> >> (answered inline, below)
> >>
> >> On 19/09/2022 22.55, Adam Smith wrote:
> >>> Hello,
> >>>
> >>> In trying to understand the differences in IRQ utilization and
> >>> throughput when performing XDP_REDIRECT in a simple netfilter bridge
> >>> on the Intel i40e, we have encountered behavior we are unable to
> >>> explain and we would like advice on where to investigate next.
> >>>
> >>> The two questions we are seeking guidance for are:
> >>> 1) Why does XDP in the i40e driver handle interrupts on multiple IRQs,
> >>> while the same flows are serviced by a single IRQ without XDP
> >>> (netfilter bridge)?
> >>>
> >>
> >> Remember IRQ smp affinity is configurable via /proc/irq/ files.
> >> Below bash code simply uses the queue number as the assigned CPU number.
> >>
> >> echo " --- Align IRQs: i40e ---"
> >> # i40e have driver name as starting prefix, making it easier to "catch"
> >> for F in /proc/irq/*/i40e*-TxRx-*/../smp_affinity_list; do
> >>      # Extract irqname e.g. "i40e-eth2-TxRx-1"
> >>      irqname=$(basename $(dirname $(dirname $F))) ;
> >>      # Substring pattern removal to extract Q-number
> >>      hwq_nr=${irqname#*-*-*-}
> >>      echo $hwq_nr > $F
> >>      #grep . -H $F;
> >> done
> >>
> >> Thus we get this one-to-one mapping of Q-to-CPU number:
> >>
> >>    $ grep -H . /proc/irq/*/i40e*-TxRx-*/../smp_affinity_list
> >>    /proc/irq/218/i40e-i40e1-TxRx-0/../smp_affinity_list:0
> >>    /proc/irq/219/i40e-i40e1-TxRx-1/../smp_affinity_list:1
> >>    /proc/irq/220/i40e-i40e1-TxRx-2/../smp_affinity_list:2
> >>    /proc/irq/221/i40e-i40e1-TxRx-3/../smp_affinity_list:3
> >>    /proc/irq/222/i40e-i40e1-TxRx-4/../smp_affinity_list:4
> >>    /proc/irq/223/i40e-i40e1-TxRx-5/../smp_affinity_list:5
> >>    /proc/irq/224/i40e-0000:04:00.0:fdir-TxRx-0/../smp_affinity_list:0
> >
> > Apologies, I should have mentioned that IRQ affinity was already
> > pinned via the recommended set_irq_affinity script from Intel driver
> > tools.
>
> It sounds like a bug (Cc Magnus) if you had pinned the IRQ affinity and
> then traffic migrates around anyway.
>
> There can be two possible setup issue that cause this:
>
> (#1) When loading XDP some drivers reset too much of the link
>       config (like ixgbe), which could lead to smp_affinity getting reset
>       to defaults. (Check setting with above grep -H . )

Verified that smp_affinity is not being reset.

> (#2) The i40e NICs hardware ATR/Flow-Director could be the one that
>       moves traffic in unexpected ways.  You can disable it via ethtool
>       running this command:
>
>     ethtool -K i40e1 ntuple-filters off
>
> (Cc Federico as they mention ATR/Flow-Director in their paper)

Disabling ntuple-filters did not change the behavior.

> >>> 2) Why does the i40e driver with XDP under load seemingly get faster
> >>> when tracing is attached to functions inside the driver’s napi_poll
> >>> loop?
> >>
> >> My theory is: Because you keep the CPU from going into sleep states.
> >>
> >>> Our working theory is that the i40e driver is not as efficient in
> >>> interrupt handling when XDP is enabled. Something in napi_poll is
> >>> looping too aggressively, and, when artificially slowed by attaching
> >>> to various kprobes and tracepoints, the slightly delayed code becomes
> >>> more efficient.
> >>>
> >>> Testing setup:
> >>>
> >>
> >> So, the test setup is basically a forwarding scenario using bridging.
> >> (It reminds me, we should add BPF bridge FIB lookup helpers... Cc lorenzo)
> >>
> >>> Without XDP, our iperf3 test utilizes almost 100% CPU on a single core
> >>> to achieve approximately 9.42 Gbits/sec. Total hard IRQs over 10
> >>> seconds is as follows:
> >>> i40e-enp1s0f1-TxRx-1            127k
> >>> Iperf3 retransmissions are roughly 0.
> >>
> >> The key here is that your test utilizes almost 100% CPU on a single
> >> core.  From this info I know that the CPU isn't going into deep sleep
> >> states.
> >>
> >>
> >>> With simple XDP_REDIRECT programs installed on both interfaces, CPU
> >>> usage drops to ~43% on two different cores (one significantly higher
> >>> than the other), and hard IRQs over 10 seconds is as follows:
> >>> i40e-enp1s0f0-TxRx-1            169k
> >>> i40e-enp1s0f0-TxRx-2              82k
> >>
> >> To avoid the jumping between IRQs, you should configure the smp_affinity
> >> as described above, BUT it will not solve the drop issue.
> >
> > As stated above, IRQs were pinned, which is what led us to question
> > the difference between XDP & Linux bridge.
> >
> >>> i40e-enp1s0f1-TxRx-1            147k
> >>> i40e-enp1s0f1-TxRx-2            235k
> >>> Throughput in this case is only ~8.75 Gbits/sec, and iperf3
> >>> retransmissions number between 1k and 3k consistently.
> >>
> >> The XDP redirect is so fast that the CPU is bored and decides to dive
> >> into deep sleep state levels.  If the time it takes to wakeup again +
> >> overhead of starting NAPI (hardirq->softirq) is too long, then packets
> >> will be dropped due to overflowing hardware RX-queue.
> >>
> >> You can directly see the time/latency it takes to wake up from these
> >> sleep states on your hardware from this grep command:
> >>
> >>    $ grep -H . /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
> >>    /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
> >>    /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
> >>    /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
> >>    /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:40
> >>    /sys/devices/system/cpu/cpu0/cpuidle/state4/latency:133
> >>
> >> As explained in[1] you can calculate back how many bytes are able to
> >> arrive at a given link speed when sleeping e.g. 133 usec, and then based
> >> on the expected packet size figure out if the default 512 slots RX-queue
> >> for i40e is large enough.
> >>
> >> [1]
> >> https://github.com/torvalds/linux/blob/v6.0-rc6/samples/bpf/xdp_redirect_cpu_user.c#L331-L346
> >
> > RX-queue size was set to 4096 for our tests, which is the maximum
> > available on the X710.
> >
> >>> When we use bpftrace to attach multiple BPF programs to i40e functions
> >>> involved in XDP (e.g., `bpftrace -e ‘tracepoint:i40e:i40e_clean_rx_irq
> >>> {} kprobe:i40e_xmit_xdp_ring {}’), retransmissions drop to 0,
> >>> throughput increases to 9.4 Gbits/sec, and CPU utilization on the
> >>> busier CPU increases to ~73%. Hard IRQs are similar to the
> >>> XDP_REDIRECT IRQs above.
> >>>
> >>> Attaching traces should not logically result in a throughput increase.
> >>>
> >>> Any insight or guidance would be greatly appreciated!
> >>
> >> Solution#1: Sysadm can configured system to avoid deep-sleep via:
> >>
> >>    # tuned-adm profile network-latency
> >>
> >> Solution#2: Can be combined with increasing RX-queue size via:
> >>
> >>    # ethtool -G i40e1 rx 2048
> >>
> >> --Jesper
> >
> > Thank you very much! Changing CPU sleep behaviors explained our 2nd
> > issue from above with retransmissions and slower speeds without
> > profiling attached.
> >
> > We are still at a loss as to the differences in number of IRQs used
> > between XDP & bridge mode, but performance is now aligned with our
> > expectations. In rechecking these numbers after tuning the CPU with
> > tuned-adm, we did notice that XDP generates roughly 10x the number of
> > hard irqs compared to non-XDP bridge mode, but only on one
> > interrupt/core. See:
> >
>
> Good to hear the tuned-adm trick worked for you.
>
> > Non-XDP Bridge
> > $ sudo hardirqs -C 10 1
>
> Assume this is 1 sec sampling.

This is a 10 second sample window printed 1 time.

> > Tracing hard irq events... Hit Ctrl-C to end.
> > HARDIRQ                    TOTAL_count
> > [...]
> > i40e-enp1s0f1-TxRx-1     118820
> >
>
> Kind weird enp1s0f0 doesn't show up.

It is worth mentioning that enp1s0f1 is the egress interface,
suggesting that these interrupts occur on transmit. I agree that it
seems odd that enp1s0f0 doesn't at least show similar transmit for the
returning ACKs.

> Processing 10Gbit/s with MTU 1500 bytes packets means 833,333 packet per
> sec will be arriving (10*10^9/8/1500).
>
> Thus, NAPI pool must be doing some bulk processing, as your hardirq's
> are 118.820/sec.

Correcting for time factor, it's 11882/sec.

> I have a number of "napi_monitor" tools (I've implemented myself) to see
> what level of bulking you get.
> The @napi_bulk output from this bpftrace script shows you bulking:
>
> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/napi_monitor.bt
>
> Script do much more (latency measurements) so you likely just want to
> use this oneliner:
>
>   bpftrace -e 'tracepoint:napi:napi_poll { @napi_bulk=lhist(args->work,
> 0, 64, 4); }'
>
>
>
> > XDP (same network flow)
> > $ sudo hardirqs -C 10 1
> > Tracing hard irq events... Hit Ctrl-C to end.
> > HARDIRQ                    TOTAL_count
> > [...]
> > i40e-enp1s0f0-TxRx-2       79071
> > i40e-enp1s0f0-TxRx-1     106929
> > i40e-enp1s0f1-TxRx-2     993162
> > i40e-enp1s0f1-TxRx-1     108362
> >
>
> Ignoring the weird RX-queue spread.
>
> Only looking at i40e-enp1s0f1-TxRx-2 with 993,162 IRQ/sec.
> This is way too high, as we just learned 833Kpps should be the packet
> numbers.  As this is a TCP flow going though the bridging box, then we
> likely also need to account for the ACK packets flowing in the other
> direction, thus we can get above the 833Kpps.
>
> This could indicate that NAPI loop only process a single packet at the
> time, which is weird.
>
> Could you check what NAPI is doing, again with oneliner:
>
>   bpftrace -e 'tracepoint:napi:napi_poll { @napi_bulk=lhist(args->work,
> 0, 64, 4); }'
>
>
> > Is it possible that we are seeing hard interrupts from both the RX &
> > TX packets under XDP?
>
> NIC drivers often tries to do the work of the TX DMA completion
> interrupts while processing RX packets anyhow, thus often the TX IRQ are
> significant lower if this happens.

As mentioned above, enp1s0f1 is the egress interface. We see
approximately a 10 to 1 ratio of transmitted packets to return ACK
packets, which seems to line up with the interrupt counters we are
seeing.

Repeated from above for clarity, with added annotations:

XDP (same network flow)
$ sudo hardirqs -C 10 1
Tracing hard irq events... Hit Ctrl-C to end.
HARDIRQ                    TOTAL_count
[...]
i40e-enp1s0f0-TxRx-2       79071  *[should be data receive if XDP is
respecting CPU affinity for transmit]
i40e-enp1s0f0-TxRx-1     106929  *[should be ACK transmitting if XDP
is respecting CPU affinity for transmit]
i40e-enp1s0f1-TxRx-2     993162  *[assumed iperf data transmit]
i40e-enp1s0f1-TxRx-1     108362  *[assumed ACK responses]

TSO & GSO are both disabled, so we are unsure why there's asymmetry on
TxRx-2 between the 2 interfaces.

> > In non-XDP, we notice that we are only seeing
> > one network interface producing hard interrupts and we are assuming
> > that the other interface must be serviced fully by polling.
>
> That is kind of weird, not seeing any IRQs for enp1s0f0.
>
> You should use this improved version of the oneline to also get ifnames:
>
>   sudo bpftrace -e 'tracepoint:napi:napi_poll {
> @napi_bulk[str(args->dev_name)]=lhist(args->work, 0, 64, 4); }'
>
> --Jesper

We ran the above bpftrace command both with and without XDP enabled:

(bar charts removed)

XDP
------------------------------
@napi_bulk[enp1s0f0]:
[0, 4)             91335
[4, 8)                90
[8, 12)            43776
[12, 16)              31
[16, 20)           47939
[20, 24)              23
[24, 28)           10105
[28, 32)              28
[32, 36)            2350
[36, 40)              14
[40, 44)            1021
[44, 48)              12
[48, 52)             586
[52, 56)              24
[56, 60)            2285
[60, 64)              16
[64, 68)          110876

@napi_bulk[enp1s0f1]:
[0, 4)           1177723
[4, 8)                82
[8, 12)                7
[12, 16)               1
[16, 20)               3
[20, 24)               2
[24, 28)               1
[28, 32)               1
[32, 36)               0
[36, 40)               0
[40, 44)               0
[44, 48)               0
[48, 52)               1
[52, 56)               0
[56, 60)               0
[60, 64)               0
[64, 68)              41

No XDP (Linux bridge only)
-----------------------------
@napi_bulk[enp1s0f0]:
[64, 68)          122992

@napi_bulk[enp1s0f1]:
[0, 4)                  2800
[4, 8)                  1565
[8, 12)              91207
[12, 16)                820
[16, 20)            26599

*As a side note, @napi_bulk[str(args->dev_name)] wasn't recognized as
a valid arg by bpftrace, despite showing up in the format for the
tracepoint, so we used args->napi->dev->name for the key. I'm not sure
what I'm missing there, but it seemed worth mentioning.

Thanks!

Adam Smith