Re: [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()

Daniel Xu <dxu@xxxxxxxxx> · Tue, 20 Aug 2024 17:29:45 -0700

Hi Olek,

On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote:
[..]
> > Thanks A LOT for doing this benchmarking!
> 
> I optimized the code a bit and picked my old patches for bulk NAPI skb
> cache allocation and today I got 4.7 Mpps 🎉
> IOW, the result of the series (7 patches totally, but 2 are not
> networking-related) is 2.7 -> 4.7 Mpps == 75%!
> 
> Daniel,
> 
> if you want, you can pick my tree[0], either full or just up to
> 
> "bpf: cpumap: switch to napi_skb_cache_get_bulk()"
> 
> (13 patches total: 6 for netdev_feature_t and 7 for the cpumap)
> 
> and test with your usecases. Would be nice to see some real world
> results, not my synthetic tests :D
> 
> > --Jesper
> 
> [0]
> https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/

So it turns out keeping the workload in place while I update and reboot
the kernel is a Hard Problem. I'll put in some more effort and see if I
can get one of the workloads to stay still, but it'll be a somewhat
noisy test even if it works. So the following are synthetic tests
(neper) but on a real prod setup as far as container networking and
configuration is concerned.

I cherry-picked 586be610~1..ca22ac8e9de onto our 6.9-ish branch. Had to
skip some of the flag refactors b/c of conflicts - I didn't know the
code well enough to do fixups. So I had to apply this diff (FWIW not sure
the struct_size() here was right anyhow):

diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 089d19c62efe..359fbfaa43eb 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -110,7 +110,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
 	if (!cmap->cpu_map)
 		goto free_cmap;
 
-	dev = bpf_map_area_alloc(struct_size(dev, priv, 0), NUMA_NO_NODE);
+	dev = bpf_map_area_alloc(sizeof(*dev), NUMA_NO_NODE);
 	if (!dev)
 		goto free_cpu_map;
 

==== Baseline ===
	./tcp_rr -c -H $SERVER -p 50,90,99 -T4 -F8 -l30				./tcp_stream -c -H $SERVER -T8 -F16 -l30

	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	2578189	        0.00008831	0.00010623	0.00013439		Run 1	15427.22
Run 2	2657923	        0.00008575	0.00010239	0.00012927		Run 2	15272.12
Run 3	2700402	        0.00008447	0.00010111	0.00013183		Run 3	14871.35
Run 4	2571739	        0.00008575	0.00011519	0.00013823		Run 4	15344.72
Run 5	2476427	        0.00008703	0.00013055	0.00016895		Run 5	15193.2
Average	2596936	        0.000086262	0.000111094	0.000140534		Average	15221.722

=== cpumap NAPI patches ===
	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
Run 1	2554598	        0.00008703	0.00011263	0.00013055		Run 1	17090.29
Run 2	2478905	        0.00009087	0.00011391	0.00014463		Run 2	16742.27
Run 3	2418599	        0.00009471	0.00011007	0.00014207		Run 3	17555.3
Run 4	2562463	        0.00008959	0.00010367	0.00013055		Run 4	17892.3
Run 5	2716551	        0.00008127	0.00010879	0.00013439		Run 5	17578.32
Average	2546223.2	0.000088694	0.000109814	0.000136438		Average	17371.696
Delta	-1.95%	        2.82%	        -1.15%	        -2.91%			        14.12%


So it looks like the GRO patches work quite well out of the box. It's
curious that tcp_rr transactions go down a bit, though. I don't have any
intuition around that.

Lemme know if you wanna change some stuff and get a rerun.

Thanks,
Daniel