On Wed, Aug 21, 2024 at 03:16:51PM GMT, Alexander Lobakin wrote: > From: Daniel Xu <dxu@xxxxxxxxx> > Date: Tue, 20 Aug 2024 17:29:45 -0700 > > > Hi Olek, > > > > On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote: > > [..] > >>> Thanks A LOT for doing this benchmarking! > >> > >> I optimized the code a bit and picked my old patches for bulk NAPI skb > >> cache allocation and today I got 4.7 Mpps 🎉 > >> IOW, the result of the series (7 patches totally, but 2 are not > >> networking-related) is 2.7 -> 4.7 Mpps == 75%! > >> > >> Daniel, > >> > >> if you want, you can pick my tree[0], either full or just up to > >> > >> "bpf: cpumap: switch to napi_skb_cache_get_bulk()" > >> > >> (13 patches total: 6 for netdev_feature_t and 7 for the cpumap) > >> > >> and test with your usecases. Would be nice to see some real world > >> results, not my synthetic tests :D > >> > >>> --Jesper > >> > >> [0] > >> https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/ > > > > So it turns out keeping the workload in place while I update and reboot > > the kernel is a Hard Problem. I'll put in some more effort and see if I > > can get one of the workloads to stay still, but it'll be a somewhat > > noisy test even if it works. So the following are synthetic tests > > (neper) but on a real prod setup as far as container networking and > > configuration is concerned. > > > > I cherry-picked 586be610~1..ca22ac8e9de onto our 6.9-ish branch. Had to > > skip some of the flag refactors b/c of conflicts - I didn't know the > > code well enough to do fixups. So I had to apply this diff (FWIW not sure > > the struct_size() here was right anyhow): > > > > diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c > > index 089d19c62efe..359fbfaa43eb 100644 > > --- a/kernel/bpf/cpumap.c > > +++ b/kernel/bpf/cpumap.c > > @@ -110,7 +110,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr) > > if (!cmap->cpu_map) > > goto free_cmap; > > > > - dev = bpf_map_area_alloc(struct_size(dev, priv, 0), NUMA_NO_NODE); > > + dev = bpf_map_area_alloc(sizeof(*dev), NUMA_NO_NODE); > > Hmm, it will allocate the same amount of memory. Why do you need this? > Are you running these patches on some older kernel which doesn't have a > proper flex array at the end of &net_device? Ah my mistake, you're right. I probably looked at the 6.9 source without the flex array and confused it with net-next. But yeah, the 6.9 kernel I tested with does not have the flex array. > > > if (!dev) > > goto free_cpu_map; > > > > > > ==== Baseline === > > ./tcp_rr -c -H $SERVER -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $SERVER -T8 -F16 -l30 > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > Run 1 2578189 0.00008831 0.00010623 0.00013439 Run 1 15427.22 > > Run 2 2657923 0.00008575 0.00010239 0.00012927 Run 2 15272.12 > > Run 3 2700402 0.00008447 0.00010111 0.00013183 Run 3 14871.35 > > Run 4 2571739 0.00008575 0.00011519 0.00013823 Run 4 15344.72 > > Run 5 2476427 0.00008703 0.00013055 0.00016895 Run 5 15193.2 > > Average 2596936 0.000086262 0.000111094 0.000140534 Average 15221.722 > > > > === cpumap NAPI patches === > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > Run 1 2554598 0.00008703 0.00011263 0.00013055 Run 1 17090.29 > > Run 2 2478905 0.00009087 0.00011391 0.00014463 Run 2 16742.27 > > Run 3 2418599 0.00009471 0.00011007 0.00014207 Run 3 17555.3 > > Run 4 2562463 0.00008959 0.00010367 0.00013055 Run 4 17892.3 > > Run 5 2716551 0.00008127 0.00010879 0.00013439 Run 5 17578.32 > > Average 2546223.2 0.000088694 0.000109814 0.000136438 Average 17371.696 > > Delta -1.95% 2.82% -1.15% -2.91% 14.12% > > > > > > So it looks like the GRO patches work quite well out of the box. It's > > curious that tcp_rr transactions go down a bit, though. I don't have any > > intuition around that. > > 14% is quite nice I'd say. Is this first table taken from the cpumap as > well or just direct Rx? Both cpumap. The only variable I changed was adding your patches. Thanks, Daniel