Re: [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()

Jesper Dangaard Brouer <hawk@xxxxxxxxxx> · Tue, 13 Aug 2024 11:51:45 +0200

On 13/08/2024 03.33, Jakub Kicinski wrote:
On Fri, 9 Aug 2024 14:20:25 +0200 Alexander Lobakin wrote:
But I think one solution could be:

1. We create some generic structure for cpumap, like

struct cpumap_meta {
	u32 magic;
	u32 hash;
}

2. We add such check in the cpumap code

	if (xdpf->metalen == sizeof(struct cpumap_meta) &&
	    <here we check magic>)
		skb->hash = meta->hash;

3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.

I wonder what the overhead of skb metadata allocation is in practice.
With Eric's "return skb to the CPU of origin" we can feed the lockless
skb cache one the right CPU, and also feed the lockless page pool
cache. I wonder if batched RFS wouldn't be faster than the XDP thing
that requires all the groundwork.

I explicitly developed CPUMAP because I was benchmarking Receive Flow
Steering (RFS) and Receive Packet Steering (RPS), which I observed was
the bottleneck.  The overhead was too large on the RX-CPU and bottleneck
due to RFS and RPS maintaining data structures to avoid Out-of-Order
packets.   The Flow Dissector step was also a limiting factor.

By bottleneck I mean it didn't scale, as RX-CPU packet per second
processing speeds was too low compared to the remote-CPU pps.
Digging in my old notes, I can see that RPS was limited to around 4.8
Mpps (and I have a weird disabling part of it showing 7.5Mpps).  In [1]
remote-CPU could process (starts at) 2.7 Mpps when dropping UDP packet
due to UdpNoPorts configured (and baseline 3.3 Mpps if not remote), thus
it only scales up-to 1.78 remote-CPUs.  [1] shows how optimizations
brings remote-CPU to handle 3.2Mpps (close non-remote to 3.3Mpps
baseline). In [2] those optimizations bring remote-CPU to 4Mpps (for
UdpNoPorts case).  XDP RX-redirect in [1]+[2] was around 19Mpps (which
might be lower today due to perf paper cuts).

 [1] 
https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap02-optimizations.org
 [2] 
https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap03-optimizations.org

The benefits Eric's "return skb to the CPU of origin" should help
improve the case for the remote-CPU, as I was seeing some bottlenecks in
how we returned the memory.

--Jesper