On Fri, 02 Oct 2020 11:06:12 -0700 John Fastabend <john.fastabend@xxxxxxxxx> wrote: > Lorenzo Bianconi wrote: > > > Lorenzo Bianconi wrote: > > > > This series introduce XDP multi-buffer support. The mvneta driver is > > > > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers > > > > please focus on how these new types of xdp_{buff,frame} packets > > > > traverse the different layers and the layout design. It is on purpose > > > > that BPF-helpers are kept simple, as we don't want to expose the > > > > internal layout to allow later changes. > > > > > > > > For now, to keep the design simple and to maintain performance, the XDP > > > > BPF-prog (still) only have access to the first-buffer. It is left for > > > > later (another patchset) to add payload access across multiple buffers. > > > > This patchset should still allow for these future extensions. The goal > > > > is to lift the XDP MTU restriction that comes with XDP, but maintain > > > > same performance as before. > > > > > > > > The main idea for the new multi-buffer layout is to reuse the same > > > > layout used for non-linear SKB. This rely on the "skb_shared_info" > > > > struct at the end of the first buffer to link together subsequent > > > > buffers. Keeping the layout compatible with SKBs is also done to ease > > > > and speedup creating an SKB from an xdp_{buff,frame}. Converting > > > > xdp_frame to SKB and deliver it to the network stack is shown in cpumap > > > > code (patch 13/13). > > > > > > Using the end of the buffer for the skb_shared_info struct is going to > > > become driver API so unwinding it if it proves to be a performance issue > > > is going to be ugly. So same question as before, for the use case where > > > we receive packet and do XDP_TX with it how do we avoid cache miss > > > overhead? This is not just a hypothetical use case, the Facebook > > > load balancer is doing this as well as Cilium and allowing this with > > > multi-buffer packets >1500B would be useful. > > > > > > Can we write the skb_shared_info lazily? It should only be needed once > > > we know the packet is going up the stack to some place that needs the > > > info. Which we could learn from the return code of the XDP program. > > > > Hi John, > > Hi, I'll try to join the two threads this one and the one on helpers here > so we don't get too fragmented. > > > > > I agree, I think for XDP_TX use-case it is not strictly necessary to fill the > > skb_hared_info. The driver can just keep this info on the stack and use it > > inserting the packet back to the DMA ring. > > For mvneta I implemented it in this way to keep the code aligned with ndo_xdp_xmit > > path since it is a low-end device. I guess we are not introducing any API constraint > > for XDP_TX. A high-end device can implement multi-buff for XDP_TX in a different way > > in order to avoid the cache miss. > > Agree it would be an implementation detail for XDP_TX except the two > helpers added in this series currently require it to be there. That is a good point. If you look at the details, the helpers use xdp_buff->mb bit to guard against accessing the "shared_info" cacheline. Thus, for the normal single frame case XDP_TX should not see a slowdown. Do we really need to optimize XDP_TX multi-frame case(?) > > > > We need to fill the skb_shared info only when we want to pass the frame to the > > network stack (build_skb() can directly reuse skb_shared_info->frags[]) or for > > XDP_REDIRECT use-case. > > It might be good to think about the XDP_REDIRECT case as well then. If the > frags list fit in the metadata/xdp_frame would we expect better > performance? I don't like to use space in xdp_frame for this. (1) We (Ahern and I) are planning to use the space in xdp_frame for RX-csum + RX-hash +vlan, which will be more common (e.g. all packets will have HW RX+csum). (2) I consider XDP multi-buffer the exception case, that will not be used in most cases, so why reserve space for that in this cache-line. IMHO we CANNOT allow any slowdown for existing XDP use-cases, but IMHO XDP multi-buffer use-cases are allowed to run "slower". > Looking at skb_shared_info{} that is a rather large structure with many A cache-line detail about skb_shared_info: The first frags[0] member is in the first cache-line. Meaning that it is still fast to have xdp frames with 1 extra buffer. > fields that look unnecessary for XDP_REDIRECT case and only needed when > passing to the stack. Yes, I think we can use first cache-line of skb_shared_info more optimally (via defining a xdp_shared_info struct). But I still want us to use this specific cache-line. Let me explain why below. (Avoiding cache-line misses is all about the details, so I hope you can follow). Hopefully most driver developers understand/knows this. In the RX-loop the current RX-descriptor have a status that indicate there are more frame, usually expressed as non-EOP (End-Of-Packet). Thus, a driver can start a prefetchw of this shared_info cache-line, prior to processing the RX-desc that describe the multi-buffer. (Remember this shared_info is constructed prior to calling XDP and any XDP_TX action, thus the XDP prog should not see a cache-line miss when using the BPF-helper to read shared_info area). > Fundamentally, a frag just needs > > struct bio_vec { > struct page *bv_page; // 8B > unsigned int bv_len; // 4B > unsigned int bv_offset; // 4B > } // 16B > > With header split + data we only need a single frag so we could use just > 16B. And worse case jumbo frame + header split seems 3 entries would be > enough giving 48B (header plus 3 4k pages). For jumbo-frame 9000 MTU 2 entries might be enough, as we also have room in the first buffer (((9000-(4096-256-320))/4096 = 1.33789). The problem is that we need to support TSO (TCP Segmentation Offload) use-case, which can have more frames. Thus, 3 entries will not be enough. > Could we just stick this in the metadata and make it read only? Then > programs that care can read it and get all the info they need without > helpers. I don't see how that is possible. (1) the metadata area is only 32 bytes, (2) when freeing an xdp_frame the kernel need to know the layout as these points will be free'ed. > I would expect performance to be better in the XDP_TX and > XDP_REDIRECT cases. And copying an extra worse case 48B in passing to > the stack I guess is not measurable given all the work needed in that > path. I do agree, that when passing to netstack we can do a transformation from xdp_shared_info to skb_shared_info with a fairly small cost. (The TSO case would require more copying). Notice that allocating an SKB, will always clear the first 32 bytes of skb_shared_info. If the XDP driver-code path have done the prefetch as described above, then we should see a speedup for netstack delivery. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer