Re: XDP redirect throughput with multi-CPU i40e

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Maciej - in this particular situation, `combined 4` was selected
because the CPU being used only has 4 cores, and 4 is what the driver
auto-selects upon boot as well.

Toke - we are seeing drops from `port.rx_dropped` on both interfaces:

1 IRQ, same CPU, no XDP -- port.rx_dropped: 0 pps / interface
1 IRQ, same CPU, XDP_REDIRECT -- port.rx_dropped: appx. 50-75 pps / interface
4 IRQ, no XDP -- port.rx_dropped: appx. 25-50 pps / interface
4 IRQ, XDP_REDIRECT -- port.rx_dropped: appx. 2000 pps / interface

`rx_dropped` remains 0 in all cases.

Of note, when XDP is not used in a 4 IRQ setup, CPU load is shown on 2
cores, related to the IRQs handling the 2 primary traffic flows
generated by bidirectional iperf3 (a byproduct of RSS). When XDP is
used, the load on those two cores drops significantly, but we see an
increased load on a 3rd core.

Thanks!
Adam

On Wed, Jul 13, 2022 at 5:17 AM Maciej Fijalkowski
<maciej.fijalkowski@xxxxxxxxx> wrote:
>
> On Tue, Jul 12, 2022 at 11:19:11PM +0200, Toke Høiland-Jørgensen wrote:
> > Adam Smith <hrotsvit@xxxxxxxxx> writes:
> >
> > > Hello,
> > >
> > > I have a question regarding bpf_redirect/bpf_redirect_map and latency
> > > that we are seeing in a test. The environment is as follows:
> > >
> > > - Debian Bullseye, running 5.18.0-0.bpo.1-amd64 kernel from
> > > Bullseye-backports (Also tested on 5.16)
> > > - Intel Xeon X3430 @ 2.40GHz. 4 cores, no HT
> > > - Intel X710-DA2 using i40e driver included with the kernel.
> > > - Both interfaces (enp1s0f0 and enps0f1) in a simple netfilter bridge.
> > > - Ring parameters for rx/tx are both set to the max of 4096, with no
> > > other nic-specific parameters changed.
> > >
> > > Each interface has 4 combined IRQs, pinned per set_irq_affinity.
> > > `irqbalanced` is not installed.
> > >
> > > Traffic is generated by another directly attached machine via iperf3
> > > 3.9 (`iperf3 -c -t 0 192.168.1.3 --bidir`) to a directly attached
> > > server on the other side.
> > >
> > > The server in question does nothing more than forward packets as a
> > > transparent bridge.
> > >
> > > An XDP program is installed on f0 to redirect to f1, and f1 to
> > > redirect to f0. I have tried programs that simply call
> > > `bpf_redirect()`, as well as programs that share a device map and call
> > > `bpf_redirect_map()`, with idententical results.
> > >
> > > When channel parameters for each interface are reduced to a single IRQ
> > > via `ethtool -L enp1s0f0 combined 1`, and both interface IRQs are
> > > bound to the same CPU core via smp_affinity, XDP produces improved
> > > bitrate with reduced CPU utilization over non-XDP tests:
> > > - Stock netfilter bridge: 9.11 Gbps in both directions at 98%
> > > utilization of pinned core.
> > > - XDP: Approximately 9.18 Gbps in both directions at 50% utilization
> > > of pinned core.
> > >
> > > However, when multiple cores are engaged (combined 4, with
> > > set_irq_affinity), XDP processes markedly fewer packets per second
> > > (950,000 vs approximately 1.6 million). iperf3 also shows a large
> > > number of retransmissions in its output regardless of CPU engagement
> > > (approximately 6,500 with XDP over 2 minutes vs 850 with single core
> > > tests).
> > >
> > > This is a sample taken from linux/samples xdp_monitor showing
> > > redirection and transmission of packets with XDP engaged:
> > >
> > > Summary                              944,508 redir/s            0
> > > err,drop/s    944,506 xmit/s
> > >   kthread                                           0 pkt/s
> > >    0 drop/s                   0 sched
> > >   redirect total                        944,508 redir/s
> > >       cpu:0                               470,148 redir/s
> > >       cpu:2                                 15,078 redir/s
> > >       cpu:3                               459,282 redir/s
> > >   redirect_err                                    0 error/s
> > >   xdp_exception                                0 hit/s
> > >   devmap_xmit total               944,506 xmit/s               0
> > > drop/s         0 drv_err/s
> > >      cpu:0                                 470,148 xmit/s
> > >  0 drop/s         0 drv_err/s
> > >      cpu:2                                   15,078 xmit/s
> > >   0 drop/s         0 drv_err/s
> > >      cpu:3                                 459,280 xmit/s
> > >  0 drop/s         0 drv_err/s
> > >   xmit enp1s0f0->enp1s0f1    485,249 xmit/s                0 drop/s
> > >      0 drv_err/s
> > >      cpu:0                                 470,172 xmit/s
> > >   0 drop/s         0 drv_err/s
> > >      cpu:2                                   15,078 xmit/s
> > >    0 drop/s         0 drv_err/s
> > >   xmit enp1s0f1->enp1s0f0    459,263 xmit/s                0 drop/s
> > >      0 drv_err/s
> > >      cpu:3                                 459,263 xmit/s
> > >   0 drop/s         0 drv_err/s
> > >
> > > Our current hypothesis is that this is a CPU affinity issue. We
> > > believe a different core is being used for transmission. In efforts to
> > > prove this, how can we successfully measure if bpf_redirect() is
> > > causing packets to be transmitted by a different core than they were
> > > received by? We are still trying to understand how bpf_redirect()
> > > selects which core/IRQ to transmit on and would appreciate any insight
> > > or followup material to research.
> >
> > There is no mechanism in bpf_redirect() to switch CPUs (outside of
> > cpumap). When you call XDP_REDIRECT, the frame will be added to a
> > per-device per-CPU flush list, which will be flushed (on that same CPU).
> > The i40e allocates separate rings for XDP, though, and not sure how it
> > does that, so maybe those are what's missing. You should be able to see
> > drops in the output if that's what's going on; and the packets should
> > still be processed by XDP.
> >
> > So sounds more like the hardware configuration is causing packet loss
> > before it even hits XDP. Do you see anything in the ethtool stats that
> > might explain where packets are being dropped?
>
> I don't know how irqs are exactly bound to which cpus but most probably
> this is driver issue as Toke is saying.
>
> i40e_xdp_xmit() uses smp_processor_id() as an index to xdp rings array, so
> if you limit queue count to 4 and bound irq to say cpu 10, you'll return
> with -ENXIO as queue_index will be >= than vsi->num_queue_pairs.
>
> I believe that such issues were addressed on ice driver. In there, xdp
> rings array is sized to num_possible_cpus() regardless of user's queue
> count setting and smp_processor_id() can be safely used.
>
> Adam, could you skip the `ethtool -L $IFACE combined 4` and work with your
> 4 flows to see if there is any difference?
>
> Maciej
>
> >
> > -Toke
> >




[Index of Archives]     [Linux Networking Development]     [Fedora Linux Users]     [Linux SCTP]     [DCCP]     [Gimp]     [Yosemite Campsites]

  Powered by Linux