Re: [PATCH] net: xdp: allow user space to request a smaller packet headroom requirement

Felix Fietkau <nbd@xxxxxxxx> · Tue, 15 Mar 2022 08:25:13 +0100

On 15.03.22 07:50, Jesper Dangaard Brouer wrote:
On 14/03/2022 23.43, Felix Fietkau wrote:

On 14.03.22 23:20, Daniel Borkmann wrote:
On 3/14/22 11:16 PM, Toke Høiland-Jørgensen wrote:
Felix Fietkau <nbd@xxxxxxxx> writes:
On 14.03.22 21:39, Jesper D. Brouer wrote:
(Cc. BPF list and other XDP maintainers)
On 14/03/2022 11.22, Felix Fietkau wrote:
Most ethernet drivers allocate a packet headroom of NET_SKB_PAD. 
Since it is
rounded up to L1 cache size, it ends up being at least 64 bytes on 
the most
common platforms.
On most ethernet drivers, having a guaranteed headroom of 256 
bytes for XDP
adds an extra forced pskb_expand_head call when enabling SKB XDP, 
which can
be quite expensive.
Many XDP programs need only very little headroom, so it can be 
beneficial
to have a way to opt-out of the 256 bytes headroom requirement.

IMHO 64 bytes is too small.
We are using this area for struct xdp_frame and also for metadata
(XDP-hints).  This will limit us from growing this structures for
the sake of generic-XDP.

I'm fine with reducting this to 192 bytes, as most Intel drivers
have this headroom, and have defacto established that this is
a valid XDP headroom, even for native-XDP.

We could go a small as two cachelines 128 bytes, as if xdp_frame
and metadata grows above a cache-line (64 bytes) each, then we have
done something wrong (performance wise).
  >>>>
Here's some background on why I chose 64 bytes: I'm currently
implementing a userspace + xdp program to act as generic fastpath to
speed network bridging.

Cc. Lorenzo as you mention you wanted to accelerate Linux bridging.
We can avoid a lot of external FIB sync in userspace, if we
add some BPF-helpers to lookup in bridge FIB table.
I talked to Lorenzo a lot about my approach already. If we do the FIB 
lookup from BPF and add all the necessary checks to handle the 
forwarding path, I believe the result is going to look a lot like the 
existing bridge code, and would not make it significantly faster.

My approach involves creating a map that caches src-mac + dest-mac + 
vlan "flows", along with the target port + vlan.
User space then takes care of flushing those entries based on FDB, port 
and config changes.
I already prototyped this with an in-kernel implementation directly in 
the bridge code, and it led to a 6-10% CPU usage reduction.
Preliminary tests indicate that with my user space + XDP implementation 
and the headroom modification, I'm getting comparable performance.

Any reason this can't run in the TC ingress hook instead? Generic XDP is
a bit of an odd duck, and I'm not a huge fan of special-casing it this
way...

+1, would have been fine with generic reduction to just down to 192 bytes
(though not less than that), but 64 is a bit too little. 

+1

Also curious on why not tc ingress instead?
  >
I chose XDP because of bpf_redirect_map, which doesn't seem to be 
available to tc ingress classifier programs.

TC have a "normal" bpf_redirect which is slower than a bpf_redirect_map.
The secret to the performance boost from bpf_redirect_map is the hidden
TX bulking layer.  I have experimented with TC bulking, which showed a
30% performance boost on TC redirect to TX with fixed 32 frame bulking.

Notice that newer libbpf makes it easier to attach to the TC hook
without shell'ing out to 'tc' binary. See how to use in this example:

https://github.com/xdp-project/bpf-examples/blob/master/tc-policy/tc_txq_policy.c

When I started writing the code, I didn't know that generic XDP 
performance would be bad on pretty much any ethernet/WLAN driver that 
wasn't updated to support it.

Yes, we kind of kept generic-XDP non-optimized as the real tool for
the  job is TC-BPF, given at this stage the SKB have already been
allocate, and converting it back to a XDP representation is not
going to be faster that TC-BPF.

Maybe we should optimize generic-XDP a bit more, because I'm
buying the argument, that developers want to write one BPF program
that works across device drivers, instead of having to handle
both TC-BPF and XDP-BPF.

Notice that generic-XDP doesn't actually do the bulking step, even
though using bpf_redirect_map. (would be obvious optimization).

If we extend kernel/bpf.devmap.c with bulking for SKBs, then both
TC-BPF and generic-XDP could share that code path, and obtain
the TX bulking performance boost via SKB-list + xmit_more.
Thanks for the information. I guess I'll rewrite my code to use TC-BPF now.

- Felix