Re: XDP redirect throughput with multi-CPU i40e

Maciej Fijalkowski <maciej.fijalkowski@xxxxxxxxx> · Wed, 13 Jul 2022 12:16:44 +0200

On Tue, Jul 12, 2022 at 11:19:11PM +0200, Toke Høiland-Jørgensen wrote:
> Adam Smith <hrotsvit@xxxxxxxxx> writes:
> 
> > Hello,
> >
> > I have a question regarding bpf_redirect/bpf_redirect_map and latency
> > that we are seeing in a test. The environment is as follows:
> >
> > - Debian Bullseye, running 5.18.0-0.bpo.1-amd64 kernel from
> > Bullseye-backports (Also tested on 5.16)
> > - Intel Xeon X3430 @ 2.40GHz. 4 cores, no HT
> > - Intel X710-DA2 using i40e driver included with the kernel.
> > - Both interfaces (enp1s0f0 and enps0f1) in a simple netfilter bridge.
> > - Ring parameters for rx/tx are both set to the max of 4096, with no
> > other nic-specific parameters changed.
> >
> > Each interface has 4 combined IRQs, pinned per set_irq_affinity.
> > `irqbalanced` is not installed.
> >
> > Traffic is generated by another directly attached machine via iperf3
> > 3.9 (`iperf3 -c -t 0 192.168.1.3 --bidir`) to a directly attached
> > server on the other side.
> >
> > The server in question does nothing more than forward packets as a
> > transparent bridge.
> >
> > An XDP program is installed on f0 to redirect to f1, and f1 to
> > redirect to f0. I have tried programs that simply call
> > `bpf_redirect()`, as well as programs that share a device map and call
> > `bpf_redirect_map()`, with idententical results.
> >
> > When channel parameters for each interface are reduced to a single IRQ
> > via `ethtool -L enp1s0f0 combined 1`, and both interface IRQs are
> > bound to the same CPU core via smp_affinity, XDP produces improved
> > bitrate with reduced CPU utilization over non-XDP tests:
> > - Stock netfilter bridge: 9.11 Gbps in both directions at 98%
> > utilization of pinned core.
> > - XDP: Approximately 9.18 Gbps in both directions at 50% utilization
> > of pinned core.
> >
> > However, when multiple cores are engaged (combined 4, with
> > set_irq_affinity), XDP processes markedly fewer packets per second
> > (950,000 vs approximately 1.6 million). iperf3 also shows a large
> > number of retransmissions in its output regardless of CPU engagement
> > (approximately 6,500 with XDP over 2 minutes vs 850 with single core
> > tests).
> >
> > This is a sample taken from linux/samples xdp_monitor showing
> > redirection and transmission of packets with XDP engaged:
> >
> > Summary                              944,508 redir/s            0
> > err,drop/s    944,506 xmit/s
> >   kthread                                           0 pkt/s
> >    0 drop/s                   0 sched
> >   redirect total                        944,508 redir/s
> >       cpu:0                               470,148 redir/s
> >       cpu:2                                 15,078 redir/s
> >       cpu:3                               459,282 redir/s
> >   redirect_err                                    0 error/s
> >   xdp_exception                                0 hit/s
> >   devmap_xmit total               944,506 xmit/s               0
> > drop/s         0 drv_err/s
> >      cpu:0                                 470,148 xmit/s
> >  0 drop/s         0 drv_err/s
> >      cpu:2                                   15,078 xmit/s
> >   0 drop/s         0 drv_err/s
> >      cpu:3                                 459,280 xmit/s
> >  0 drop/s         0 drv_err/s
> >   xmit enp1s0f0->enp1s0f1    485,249 xmit/s                0 drop/s
> >      0 drv_err/s
> >      cpu:0                                 470,172 xmit/s
> >   0 drop/s         0 drv_err/s
> >      cpu:2                                   15,078 xmit/s
> >    0 drop/s         0 drv_err/s
> >   xmit enp1s0f1->enp1s0f0    459,263 xmit/s                0 drop/s
> >      0 drv_err/s
> >      cpu:3                                 459,263 xmit/s
> >   0 drop/s         0 drv_err/s
> >
> > Our current hypothesis is that this is a CPU affinity issue. We
> > believe a different core is being used for transmission. In efforts to
> > prove this, how can we successfully measure if bpf_redirect() is
> > causing packets to be transmitted by a different core than they were
> > received by? We are still trying to understand how bpf_redirect()
> > selects which core/IRQ to transmit on and would appreciate any insight
> > or followup material to research.
> 
> There is no mechanism in bpf_redirect() to switch CPUs (outside of
> cpumap). When you call XDP_REDIRECT, the frame will be added to a
> per-device per-CPU flush list, which will be flushed (on that same CPU).
> The i40e allocates separate rings for XDP, though, and not sure how it
> does that, so maybe those are what's missing. You should be able to see
> drops in the output if that's what's going on; and the packets should
> still be processed by XDP.
> 
> So sounds more like the hardware configuration is causing packet loss
> before it even hits XDP. Do you see anything in the ethtool stats that
> might explain where packets are being dropped?

I don't know how irqs are exactly bound to which cpus but most probably
this is driver issue as Toke is saying.

i40e_xdp_xmit() uses smp_processor_id() as an index to xdp rings array, so
if you limit queue count to 4 and bound irq to say cpu 10, you'll return
with -ENXIO as queue_index will be >= than vsi->num_queue_pairs.

I believe that such issues were addressed on ice driver. In there, xdp
rings array is sized to num_possible_cpus() regardless of user's queue
count setting and smp_processor_id() can be safely used.

Adam, could you skip the `ethtool -L $IFACE combined 4` and work with your
4 flows to see if there is any difference?

Maciej

> 
> -Toke
>