Re: [PATCH v8 bpf-next 00/14] mvneta: introduce XDP multi-buffer support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 21, 2021 at 2:48 PM Jesper Dangaard Brouer
<brouer@xxxxxxxxxx> wrote:
>
> On Tue, 20 Apr 2021 15:49:44 +0200
> Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote:
>
> > On Mon, Apr 19, 2021 at 8:56 AM Lorenzo Bianconi
> > <lorenzo.bianconi@xxxxxxxxxx> wrote:
> > >
> > > > On Sun, Apr 18, 2021 at 6:18 PM Jesper Dangaard Brouer
> > > > <brouer@xxxxxxxxxx> wrote:
> > > > >
> > > > > On Fri, 16 Apr 2021 16:27:18 +0200
> > > > > Magnus Karlsson <magnus.karlsson@xxxxxxxxx> wrote:
> > > > >
> > > > > > On Thu, Apr 8, 2021 at 2:51 PM Lorenzo Bianconi <lorenzo@xxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > This series introduce XDP multi-buffer support. The mvneta driver is
> > > > > > > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers
> > > > > > > please focus on how these new types of xdp_{buff,frame} packets
> > > > > > > traverse the different layers and the layout design. It is on purpose
> > > > > > > that BPF-helpers are kept simple, as we don't want to expose the
> > > > > > > internal layout to allow later changes.
> > > > > > >
> > > > > > > For now, to keep the design simple and to maintain performance, the XDP
> > > > > > > BPF-prog (still) only have access to the first-buffer. It is left for
> > > > > > > later (another patchset) to add payload access across multiple buffers.
> > > > > > > This patchset should still allow for these future extensions. The goal
> > > > > > > is to lift the XDP MTU restriction that comes with XDP, but maintain
> > > > > > > same performance as before.
> > > > > [...]
> > > > > > >
> > > > > > > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy
> > > > > > > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
> > > > > > > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)
> > > > > >
> > > > > > Took your patches for a test run with the AF_XDP sample xdpsock on an
> > > > > > i40e card and the throughput degradation is between 2 to 6% depending
> > > > > > on the setup and microbenchmark within xdpsock that is executed. And
> > > > > > this is without sending any multi frame packets. Just single frame
> > > > > > ones. Tirtha made changes to the i40e driver to support this new
> > > > > > interface so that is being included in the measurements.
> > > > >
> > > > > Could you please share Tirtha's i40e support patch with me?
> > > >
> > > > We will post them on the list as an RFC. Tirtha also added AF_XDP
> > > > multi-frame support on top of Lorenzo's patches so we will send that
> > > > one out as well. Will also rerun my experiments, properly document
> > > > them and send out just to be sure that I did not make any mistake.
> > >
> > > ack, very cool, thx
> >
> > I have now run a new set of experiments on a Cascade Lake server at
> > 2.1 GHz with turbo boost disabled. Two NICs: i40e and ice. The
> > baseline is commit 5c507329000e ("libbpf: Clarify flags in ringbuf
> > helpers") and Lorenzo's and Eelco's path set is their v8. First some
> > runs with xdpsock (i.e. AF_XDP) in both 2-core mode (app on one core
> > and the driver on another) and 1-core mode using busy_poll.
> >
> > xdpsock rxdrop throughput change with the multi-buffer patches without
> > any driver changes:
> > 1-core i40e: -0.5 to 0%   2-cores i40e: -0.5%
> > 1-core ice: -2%   2-cores ice: -1 to -0.5%
> >
> > xdp_rxq_info -a XDP_DROP
> > i40e: -4%   ice: +8%
> >
> > xdp_rxq_info -a XDP_TX
> > i40e: -10%   ice: +9%
> >
> > The XDP results with xdp_rxq_info are just weird! I reran them three
> > times, rebuilt and rebooted in between and I always get the same
> > results. And I also checked that I am running on the correct NUMA node
> > and so on. But I have a hard time believing them. Nearly +10% and -10%
> > difference. Too much in my book. Jesper, could you please run the same
> > and see what you get?
>
> We of-cause have to find the root-cause of the +-10%, but let me drill
> into what the 10% represent time/cycle wise.  Using a percentage
> difference is usually a really good idea as it implies a comparative
> measure (something I always request people to do, as a single
> performance number means nothing by itself).
>
> For a zoom-in-benchmarks like these where the amount of code executed
> is very small, the effect of removing or adding code can effect the
> measurement a lot.
>
> I can only do the tests for i40e, as I don't have ice hardware (but
> Intel is working on fixing that ;-)).
>
>  xdp_rxq_info -a XDP_DROP
>   i40e: 33,417,775 pps

Here I only get around 21 Mpps

>  CPU is 100% used, so we can calculate nanosec used per packet:
>   29.92 nanosec (1/33417775*10^9)
>   2.1 GHz CPU =  approx 63 CPU-cycles
>
>  You lost -4% performance in this case.  This correspond to:
>   -1.2 nanosec (29.92*0.04) slower
>   (This could be cost of single func call overhead = 1.3 ns)
>
> My measurement for XDP_TX:
>
>  xdp_rxq_info -a XDP_TX
>   28,278,722 pps
>   35.36 ns (1/28278722*10^9)

And here, much lower at around 8 Mpps. But I do see correct packets
coming back on the cable for i40e but not for ice! There is likely a
bug there in the XDP_TX logic for ice. Might explain the weird results
I am getting. Will investigate.

But why do I get only a fraction of your performance? XDP_TX touches
the packet so I would expect it to be far less than what you get, but
more than I get. What CPU core do you run on? It actually looks like
your packet data gets prefetched successfully. If it had not, you
would have gotten an access to LLC which is much more expensive than
the drop you are seeing. If I run on the wrong NUMA node, I get 4
Mpps, so it is not that.

One interesting thing is that I get better results using the zero-copy
path in the driver. I start xdp_rxq_drop then tie an AF_XDP socket to
the queue id the XDP program gets its traffic from. The AF_XDP program
will get no traffic in this case, but it will force the driver to use
the zero-copy path for its XDP processing. In this case I get this:

-0.5% for XDP_DROP and +-0% for XDP_TX for i40e.

>  You lost -10% performance in this case:
>   -3.54 nanosec (35.36*0.10) slower
>
> In XDP context 3.54 nanosec is a lot, as you can see it is 10% in this
> zoom-in benchmark.  We have to look at the details.
>
> One detail/issue with i40e doing XDP_TX, is that I cannot verify that
> packets are actually transmitted... not via exception tracepoint, not
> via netstats, not via ethtool_stats.pl.  Maybe all the packets are
> getting (silently) drop in my tests...!?!
>
>
> > The xdpsock numbers are more in the ballpark of
> > what I would expect.
> >
> > Tirtha and I found some optimizations in the i40e
> > multi-frame/multi-buffer support that we have implemented. Will test
> > those next, post the results and share the code.
> >
> > > >
> > > > Just note that I would really like for the multi-frame support to get
> > > > in. I have lost count on how many people that have asked for it to be
> > > > added to XDP and AF_XDP. So please check our implementation and
> > > > improve it so we can get the overhead down to where we want it to be.
> > >
> > > sure, I will do.
> > >
> > > Regards,
> > > Lorenzo
> > >
> > > >
> > > > Thanks: Magnus
> > > >
> > > > > I would like to reproduce these results in my testlab, in-order to
> > > > > figure out where the throughput degradation comes from.
> > > > >
> > > > > > What performance do you see with the mvneta card? How much are we
> > > > > > willing to pay for this feature when it is not being used or can we in
> > > > > > some way selectively turn it on only when needed?
> > > > >
> > > > > Well, as Daniel says performance wise we require close to /zero/
> > > > > additional overhead, especially as you state this happens when sending
> > > > > a single frame, which is a base case that we must not slowdown.
> > > > >
> > > > > --
> > > > > Best regards,
> > > > >   Jesper Dangaard Brouer
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
>
>
> Running XDP on dev:i40e2 (ifindex:6) action:XDP_DROP options:read
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      2       33,417,775  0
> XDP-RX CPU      total   33,417,775
>
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    2:2   33,417,775  0
> rx_queue_index    2:sum 33,417,775
>
>
> Running XDP on dev:i40e2 (ifindex:6) action:XDP_TX options:swapmac
> XDP stats       CPU     pps         issue-pps
> XDP-RX CPU      2       28,278,722  0
> XDP-RX CPU      total   28,278,722
>
> RXQ stats       RXQ:CPU pps         issue-pps
> rx_queue_index    2:2   28,278,726  0
> rx_queue_index    2:sum 28,278,726
>
>
>



[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux