On 10/2/20 5:25 PM, John Fastabend wrote:
Lorenzo Bianconi wrote:
This series introduce XDP multi-buffer support. The mvneta driver is
the first to support these new "non-linear" xdp_{buff,frame}. Reviewers
please focus on how these new types of xdp_{buff,frame} packets
traverse the different layers and the layout design. It is on purpose
that BPF-helpers are kept simple, as we don't want to expose the
internal layout to allow later changes.
For now, to keep the design simple and to maintain performance, the XDP
BPF-prog (still) only have access to the first-buffer. It is left for
later (another patchset) to add payload access across multiple buffers.
This patchset should still allow for these future extensions. The goal
is to lift the XDP MTU restriction that comes with XDP, but maintain
same performance as before.
The main idea for the new multi-buffer layout is to reuse the same
layout used for non-linear SKB. This rely on the "skb_shared_info"
struct at the end of the first buffer to link together subsequent
buffers. Keeping the layout compatible with SKBs is also done to ease
and speedup creating an SKB from an xdp_{buff,frame}. Converting
xdp_frame to SKB and deliver it to the network stack is shown in cpumap
code (patch 13/13).
Using the end of the buffer for the skb_shared_info struct is going to
become driver API so unwinding it if it proves to be a performance issue
is going to be ugly. So same question as before, for the use case where
we receive packet and do XDP_TX with it how do we avoid cache miss
overhead? This is not just a hypothetical use case, the Facebook
load balancer is doing this as well as Cilium and allowing this with
multi-buffer packets >1500B would be useful.
[...]
Fully agree. My other question would be if someone else right now is in the process
of implementing this scheme for a 40G+ NIC? My concern is the numbers below are rather
on the lower end of the spectrum, so I would like to see a comparison of XDP as-is
today vs XDP multi-buff on a higher end NIC so that we have a picture how well the
current designed scheme works there and into which performance issue we'll run e.g.
under typical XDP L4 load balancer scenario with XDP_TX. I think this would be crucial
before the driver API becomes 'sort of' set in stone where others start to adapting
it and changing design becomes painful. Do ena folks have an implementation ready as
well? And what about virtio_net, for example, anyone committing there too? Typically
for such features to land is to require at least 2 drivers implementing it.
Typical use cases for this series are:
- Jumbo-frames
- Packet header split (please see Google���s use-case @ NetDevConf 0x14, [0])
- TSO
More info about the main idea behind this approach can be found here [1][2].
We carried out some throughput tests in a standard linear frame scenario in order
to verify we did not introduced any performance regression adding xdp multi-buff
support to mvneta:
offered load is ~ 1000Kpps, packet size is 64B, mvneta descriptor size is one PAGE
commit: 879456bedbe5 ("net: mvneta: avoid possible cache misses in mvneta_rx_swbm")
- xdp-pass: ~162Kpps
- xdp-drop: ~701Kpps
- xdp-tx: ~185Kpps
- xdp-redirect: ~202Kpps
mvneta xdp multi-buff:
- xdp-pass: ~163Kpps
- xdp-drop: ~739Kpps
- xdp-tx: ~182Kpps
- xdp-redirect: ~202Kpps
[...]