Re: XDP redirect throughput with multi-CPU i40e

Toke Høiland-Jørgensen <toke@xxxxxxxxxx> · Tue, 12 Jul 2022 23:19:11 +0200

Adam Smith <hrotsvit@xxxxxxxxx> writes:

> Hello,
>
> I have a question regarding bpf_redirect/bpf_redirect_map and latency
> that we are seeing in a test. The environment is as follows:
>
> - Debian Bullseye, running 5.18.0-0.bpo.1-amd64 kernel from
> Bullseye-backports (Also tested on 5.16)
> - Intel Xeon X3430 @ 2.40GHz. 4 cores, no HT
> - Intel X710-DA2 using i40e driver included with the kernel.
> - Both interfaces (enp1s0f0 and enps0f1) in a simple netfilter bridge.
> - Ring parameters for rx/tx are both set to the max of 4096, with no
> other nic-specific parameters changed.
>
> Each interface has 4 combined IRQs, pinned per set_irq_affinity.
> `irqbalanced` is not installed.
>
> Traffic is generated by another directly attached machine via iperf3
> 3.9 (`iperf3 -c -t 0 192.168.1.3 --bidir`) to a directly attached
> server on the other side.
>
> The server in question does nothing more than forward packets as a
> transparent bridge.
>
> An XDP program is installed on f0 to redirect to f1, and f1 to
> redirect to f0. I have tried programs that simply call
> `bpf_redirect()`, as well as programs that share a device map and call
> `bpf_redirect_map()`, with idententical results.
>
> When channel parameters for each interface are reduced to a single IRQ
> via `ethtool -L enp1s0f0 combined 1`, and both interface IRQs are
> bound to the same CPU core via smp_affinity, XDP produces improved
> bitrate with reduced CPU utilization over non-XDP tests:
> - Stock netfilter bridge: 9.11 Gbps in both directions at 98%
> utilization of pinned core.
> - XDP: Approximately 9.18 Gbps in both directions at 50% utilization
> of pinned core.
>
> However, when multiple cores are engaged (combined 4, with
> set_irq_affinity), XDP processes markedly fewer packets per second
> (950,000 vs approximately 1.6 million). iperf3 also shows a large
> number of retransmissions in its output regardless of CPU engagement
> (approximately 6,500 with XDP over 2 minutes vs 850 with single core
> tests).
>
> This is a sample taken from linux/samples xdp_monitor showing
> redirection and transmission of packets with XDP engaged:
>
> Summary                              944,508 redir/s            0
> err,drop/s    944,506 xmit/s
>   kthread                                           0 pkt/s
>    0 drop/s                   0 sched
>   redirect total                        944,508 redir/s
>       cpu:0                               470,148 redir/s
>       cpu:2                                 15,078 redir/s
>       cpu:3                               459,282 redir/s
>   redirect_err                                    0 error/s
>   xdp_exception                                0 hit/s
>   devmap_xmit total               944,506 xmit/s               0
> drop/s         0 drv_err/s
>      cpu:0                                 470,148 xmit/s
>  0 drop/s         0 drv_err/s
>      cpu:2                                   15,078 xmit/s
>   0 drop/s         0 drv_err/s
>      cpu:3                                 459,280 xmit/s
>  0 drop/s         0 drv_err/s
>   xmit enp1s0f0->enp1s0f1    485,249 xmit/s                0 drop/s
>      0 drv_err/s
>      cpu:0                                 470,172 xmit/s
>   0 drop/s         0 drv_err/s
>      cpu:2                                   15,078 xmit/s
>    0 drop/s         0 drv_err/s
>   xmit enp1s0f1->enp1s0f0    459,263 xmit/s                0 drop/s
>      0 drv_err/s
>      cpu:3                                 459,263 xmit/s
>   0 drop/s         0 drv_err/s
>
> Our current hypothesis is that this is a CPU affinity issue. We
> believe a different core is being used for transmission. In efforts to
> prove this, how can we successfully measure if bpf_redirect() is
> causing packets to be transmitted by a different core than they were
> received by? We are still trying to understand how bpf_redirect()
> selects which core/IRQ to transmit on and would appreciate any insight
> or followup material to research.

There is no mechanism in bpf_redirect() to switch CPUs (outside of
cpumap). When you call XDP_REDIRECT, the frame will be added to a
per-device per-CPU flush list, which will be flushed (on that same CPU).
The i40e allocates separate rings for XDP, though, and not sure how it
does that, so maybe those are what's missing. You should be able to see
drops in the output if that's what's going on; and the packets should
still be processed by XDP.

So sounds more like the hardware configuration is causing packet loss
before it even hits XDP. Do you see anything in the ethtool stats that
might explain where packets are being dropped?

-Toke