Re: [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()

Alexander Lobakin <aleksander.lobakin@xxxxxxxxx> · Wed, 21 Aug 2024 15:16:51 +0200

From: Daniel Xu <dxu@xxxxxxxxx>
Date: Tue, 20 Aug 2024 17:29:45 -0700

> Hi Olek,
> 
> On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote:
> [..]
>>> Thanks A LOT for doing this benchmarking!
>>
>> I optimized the code a bit and picked my old patches for bulk NAPI skb
>> cache allocation and today I got 4.7 Mpps 🎉
>> IOW, the result of the series (7 patches totally, but 2 are not
>> networking-related) is 2.7 -> 4.7 Mpps == 75%!
>>
>> Daniel,
>>
>> if you want, you can pick my tree[0], either full or just up to
>>
>> "bpf: cpumap: switch to napi_skb_cache_get_bulk()"
>>
>> (13 patches total: 6 for netdev_feature_t and 7 for the cpumap)
>>
>> and test with your usecases. Would be nice to see some real world
>> results, not my synthetic tests :D
>>
>>> --Jesper
>>
>> [0]
>> https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/
> 
> So it turns out keeping the workload in place while I update and reboot
> the kernel is a Hard Problem. I'll put in some more effort and see if I
> can get one of the workloads to stay still, but it'll be a somewhat
> noisy test even if it works. So the following are synthetic tests
> (neper) but on a real prod setup as far as container networking and
> configuration is concerned.
> 
> I cherry-picked 586be610~1..ca22ac8e9de onto our 6.9-ish branch. Had to
> skip some of the flag refactors b/c of conflicts - I didn't know the
> code well enough to do fixups. So I had to apply this diff (FWIW not sure
> the struct_size() here was right anyhow):
> 
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index 089d19c62efe..359fbfaa43eb 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -110,7 +110,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
>  	if (!cmap->cpu_map)
>  		goto free_cmap;
>  
> -	dev = bpf_map_area_alloc(struct_size(dev, priv, 0), NUMA_NO_NODE);
> +	dev = bpf_map_area_alloc(sizeof(*dev), NUMA_NO_NODE);

Hmm, it will allocate the same amount of memory. Why do you need this?
Are you running these patches on some older kernel which doesn't have a
proper flex array at the end of &net_device?

>  	if (!dev)
>  		goto free_cpu_map;
>  
> 
> ==== Baseline ===
> 	./tcp_rr -c -H $SERVER -p 50,90,99 -T4 -F8 -l30				./tcp_stream -c -H $SERVER -T8 -F16 -l30
> 
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	2578189	        0.00008831	0.00010623	0.00013439		Run 1	15427.22
> Run 2	2657923	        0.00008575	0.00010239	0.00012927		Run 2	15272.12
> Run 3	2700402	        0.00008447	0.00010111	0.00013183		Run 3	14871.35
> Run 4	2571739	        0.00008575	0.00011519	0.00013823		Run 4	15344.72
> Run 5	2476427	        0.00008703	0.00013055	0.00016895		Run 5	15193.2
> Average	2596936	        0.000086262	0.000111094	0.000140534		Average	15221.722
> 
> === cpumap NAPI patches ===
> 	Transactions	Latency P50 (s)	Latency P90 (s)	Latency P99 (s)			Throughput (Mbit/s)
> Run 1	2554598	        0.00008703	0.00011263	0.00013055		Run 1	17090.29
> Run 2	2478905	        0.00009087	0.00011391	0.00014463		Run 2	16742.27
> Run 3	2418599	        0.00009471	0.00011007	0.00014207		Run 3	17555.3
> Run 4	2562463	        0.00008959	0.00010367	0.00013055		Run 4	17892.3
> Run 5	2716551	        0.00008127	0.00010879	0.00013439		Run 5	17578.32
> Average	2546223.2	0.000088694	0.000109814	0.000136438		Average	17371.696
> Delta	-1.95%	        2.82%	        -1.15%	        -2.91%			        14.12%
> 
> 
> So it looks like the GRO patches work quite well out of the box. It's
> curious that tcp_rr transactions go down a bit, though. I don't have any
> intuition around that.

14% is quite nice I'd say. Is this first table taken from the cpumap as
well or just direct Rx?

> 
> Lemme know if you wanna change some stuff and get a rerun.
> 
> Thanks,
> Daniel

Thanks,
Olek