Re: [PATCH v4 bpf-next 00/13] mvneta: introduce XDP multi-buffer support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 02 Oct 2020 11:06:12 -0700
John Fastabend <john.fastabend@xxxxxxxxx> wrote:

> Lorenzo Bianconi wrote:
> > > Lorenzo Bianconi wrote:  
> > > > This series introduce XDP multi-buffer support. The mvneta driver is
> > > > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers
> > > > please focus on how these new types of xdp_{buff,frame} packets
> > > > traverse the different layers and the layout design. It is on purpose
> > > > that BPF-helpers are kept simple, as we don't want to expose the
> > > > internal layout to allow later changes.
> > > > 
> > > > For now, to keep the design simple and to maintain performance, the XDP
> > > > BPF-prog (still) only have access to the first-buffer. It is left for
> > > > later (another patchset) to add payload access across multiple buffers.
> > > > This patchset should still allow for these future extensions. The goal
> > > > is to lift the XDP MTU restriction that comes with XDP, but maintain
> > > > same performance as before.
> > > > 
> > > > The main idea for the new multi-buffer layout is to reuse the same
> > > > layout used for non-linear SKB. This rely on the "skb_shared_info"
> > > > struct at the end of the first buffer to link together subsequent
> > > > buffers. Keeping the layout compatible with SKBs is also done to ease
> > > > and speedup creating an SKB from an xdp_{buff,frame}. Converting
> > > > xdp_frame to SKB and deliver it to the network stack is shown in cpumap
> > > > code (patch 13/13).  
> > > 
> > > Using the end of the buffer for the skb_shared_info struct is going to
> > > become driver API so unwinding it if it proves to be a performance issue
> > > is going to be ugly. So same question as before, for the use case where
> > > we receive packet and do XDP_TX with it how do we avoid cache miss
> > > overhead? This is not just a hypothetical use case, the Facebook
> > > load balancer is doing this as well as Cilium and allowing this with
> > > multi-buffer packets >1500B would be useful.
> > > 
> > > Can we write the skb_shared_info lazily? It should only be needed once
> > > we know the packet is going up the stack to some place that needs the
> > > info. Which we could learn from the return code of the XDP program.  
> > 
> > Hi John,  
> 
> Hi, I'll try to join the two threads this one and the one on helpers here
> so we don't get too fragmented.
> 
> > 
> > I agree, I think for XDP_TX use-case it is not strictly necessary to fill the
> > skb_hared_info. The driver can just keep this info on the stack and use it
> > inserting the packet back to the DMA ring.
> > For mvneta I implemented it in this way to keep the code aligned with ndo_xdp_xmit
> > path since it is a low-end device. I guess we are not introducing any API constraint
> > for XDP_TX. A high-end device can implement multi-buff for XDP_TX in a different way
> > in order to avoid the cache miss.  
> 
> Agree it would be an implementation detail for XDP_TX except the two
> helpers added in this series currently require it to be there.

That is a good point.  If you look at the details, the helpers use
xdp_buff->mb bit to guard against accessing the "shared_info"
cacheline. Thus, for the normal single frame case XDP_TX should not see
a slowdown.  Do we really need to optimize XDP_TX multi-frame case(?)


> > 
> > We need to fill the skb_shared info only when we want to pass the frame to the
> > network stack (build_skb() can directly reuse skb_shared_info->frags[]) or for
> > XDP_REDIRECT use-case.  
> 
> It might be good to think about the XDP_REDIRECT case as well then. If the
> frags list fit in the metadata/xdp_frame would we expect better
> performance?

I don't like to use space in xdp_frame for this. (1) We (Ahern and I)
are planning to use the space in xdp_frame for RX-csum + RX-hash +vlan,
which will be more common (e.g. all packets will have HW RX+csum).  (2)
I consider XDP multi-buffer the exception case, that will not be used
in most cases, so why reserve space for that in this cache-line.

IMHO we CANNOT allow any slowdown for existing XDP use-cases, but IMHO
XDP multi-buffer use-cases are allowed to run "slower".


> Looking at skb_shared_info{} that is a rather large structure with many

A cache-line detail about skb_shared_info: The first frags[0] member is
in the first cache-line.  Meaning that it is still fast to have xdp
frames with 1 extra buffer.

> fields that look unnecessary for XDP_REDIRECT case and only needed when
> passing to the stack. 

Yes, I think we can use first cache-line of skb_shared_info more
optimally (via defining a xdp_shared_info struct). But I still want us
to use this specific cache-line.  Let me explain why below. (Avoiding
cache-line misses is all about the details, so I hope you can follow).

Hopefully most driver developers understand/knows this.  In the RX-loop
the current RX-descriptor have a status that indicate there are more
frame, usually expressed as non-EOP (End-Of-Packet).  Thus, a driver
can start a prefetchw of this shared_info cache-line, prior to
processing the RX-desc that describe the multi-buffer.
 (Remember this shared_info is constructed prior to calling XDP and any
XDP_TX action, thus the XDP prog should not see a cache-line miss when
using the BPF-helper to read shared_info area).


> Fundamentally, a frag just needs
> 
>  struct bio_vec {
>      struct page *bv_page;     // 8B
>      unsigned int bv_len;      // 4B
>      unsigned int bv_offset;   // 4B
>  } // 16B
> 
> With header split + data we only need a single frag so we could use just
> 16B. And worse case jumbo frame + header split seems 3 entries would be
> enough giving 48B (header plus 3 4k pages). 

For jumbo-frame 9000 MTU 2 entries might be enough, as we also have
room in the first buffer (((9000-(4096-256-320))/4096 = 1.33789).

The problem is that we need to support TSO (TCP Segmentation Offload)
use-case, which can have more frames. Thus, 3 entries will not be
enough.

> Could we just stick this in the metadata and make it read only? Then
> programs that care can read it and get all the info they need without
> helpers.

I don't see how that is possible. (1) the metadata area is only 32
bytes, (2) when freeing an xdp_frame the kernel need to know the layout
as these points will be free'ed.

> I would expect performance to be better in the XDP_TX and
> XDP_REDIRECT cases. And copying an extra worse case 48B in passing to
> the stack I guess is not measurable given all the work needed in that
> path.

I do agree, that when passing to netstack we can do a transformation
from xdp_shared_info to skb_shared_info with a fairly small cost.  (The
TSO case would require more copying).

Notice that allocating an SKB, will always clear the first 32 bytes of
skb_shared_info.  If the XDP driver-code path have done the prefetch
as described above, then we should see a speedup for netstack delivery.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux