Re: [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()

Daniel Xu <dxu@xxxxxxxxx> · Wed, 21 Aug 2024 09:36:36 -0700

On Wed, Aug 21, 2024 at 03:16:51PM GMT, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@xxxxxxxxx>
> Date: Tue, 20 Aug 2024 17:29:45 -0700
> 
> > Hi Olek,
> > 
> > On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote:
> > [..]
> >>> Thanks A LOT for doing this benchmarking!
> >>
> >> I optimized the code a bit and picked my old patches for bulk NAPI skb
> >> cache allocation and today I got 4.7 Mpps 🎉
> >> IOW, the result of the series (7 patches totally, but 2 are not
> >> networking-related) is 2.7 -> 4.7 Mpps == 75%!
> >>
> >> Daniel,
> >>
> >> if you want, you can pick my tree[0], either full or just up to
> >>
> >> "bpf: cpumap: switch to napi_skb_cache_get_bulk()"
> >>
> >> (13 patches total: 6 for netdev_feature_t and 7 for the cpumap)
> >>
> >> and test with your usecases. Would be nice to see some real world
> >> results, not my synthetic tests :D
> >>
> >>> --Jesper
> >>
> >> [0]
> >> https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/
> > 
> > So it turns out keeping the workload in place while I update and reboot
> > the kernel is a Hard Problem. I'll put in some more effort and see if I
> > can get one of the workloads to stay still, but it'll be a somewhat
> > noisy test even if it works. So the following are synthetic tests
> > (neper) but on a real prod setup as far as container networking and
> > configuration is concerned.
> > 
> > I cherry-picked 586be610~1..ca22ac8e9de onto our 6.9-ish branch. Had to
> > skip some of the flag refactors b/c of conflicts - I didn't know the
> > code well enough to do fixups. So I had to apply this diff (FWIW not sure
> > the struct_size() here was right anyhow):
> > 
> > diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> > index 089d19c62efe..359fbfaa43eb 100644
> > --- a/kernel/bpf/cpumap.c
> > +++ b/kernel/bpf/cpumap.c
> > @@ -110,7 +110,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> >  	if (!cmap->cpu_map)
> >  		goto free_cmap;
> >  
> > -	dev = bpf_map_area_alloc(struct_size(dev, priv, 0), NUMA_NO_NODE);
> > +	dev = bpf_map_area_alloc(sizeof(*dev), NUMA_NO_NODE);
> 
> Hmm, it will allocate the same amount of memory. Why do you need this?
> Are you running these patches on some older kernel which doesn't have a
> proper flex array at the end of &net_device?

Ah my mistake, you're right. I probably looked at the 6.9 source without
the flex array and confused it with net-next. But yeah, the 6.9 kernel
I tested with does not have the flex array.

> 
> >  	if (!dev)
> >  		goto free_cpu_map;
> >  
> > 
> > ==== Baseline ===
> > 	./tcp_rr -c -H $SERVER -p 50,90,99 -T4 -F8 -l30				./tcp_stream -c -H $SERVER -T8 -F16 -l30
> > 
> > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > Run 1	2578189	        0.00008831	0.00010623	0.00013439		Run 1	15427.22
> > Run 2	2657923	        0.00008575	0.00010239	0.00012927		Run 2	15272.12
> > Run 3	2700402	        0.00008447	0.00010111	0.00013183		Run 3	14871.35
> > Run 4	2571739	        0.00008575	0.00011519	0.00013823		Run 4	15344.72
> > Run 5	2476427	        0.00008703	0.00013055	0.00016895		Run 5	15193.2
> > Average	2596936	        0.000086262	0.000111094	0.000140534		Average	15221.722
> > 
> > === cpumap NAPI patches ===
> > 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> > Run 1	2554598	        0.00008703	0.00011263	0.00013055		Run 1	17090.29
> > Run 2	2478905	        0.00009087	0.00011391	0.00014463		Run 2	16742.27
> > Run 3	2418599	        0.00009471	0.00011007	0.00014207		Run 3	17555.3
> > Run 4	2562463	        0.00008959	0.00010367	0.00013055		Run 4	17892.3
> > Run 5	2716551	        0.00008127	0.00010879	0.00013439		Run 5	17578.32
> > Average	2546223.2	0.000088694	0.000109814	0.000136438		Average	17371.696
> > Delta	-1.95%	        2.82%	        -1.15%	        -2.91%			        14.12%
> > 
> > 
> > So it looks like the GRO patches work quite well out of the box. It's
> > curious that tcp_rr transactions go down a bit, though. I don't have any
> > intuition around that.
> 
> 14% is quite nice I'd say. Is this first table taken from the cpumap as
> well or just direct Rx?

Both cpumap. The only variable I changed was adding your patches.

Thanks,
Daniel