> Hi Jesper, > > On Mon, Nov 25, 2024 at 07:50:41PM GMT, Jesper Dangaard Brouer wrote: > > > > > > On 25/11/2024 16.12, Alexander Lobakin wrote: > > > From: Daniel Xu <dxu@xxxxxxxxx> > > > Date: Fri, 22 Nov 2024 17:10:06 -0700 > > > > > > > Hi Olek, > > > > > > > > Here are the results. > > > > > > > > On Wed, Nov 13, 2024 at 03:39:13PM GMT, Daniel Xu wrote: > > > > > > > > > > > > > > > On Tue, Nov 12, 2024, at 9:43 AM, Alexander Lobakin wrote: > > > > > > [...] > > > > > > > Baseline (again) > > > > > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > > > Run 1 3169917 0.00007295 0.00007871 0.00009343 Run 1 21749.43 > > > > Run 2 3228290 0.00007103 0.00007679 0.00009215 Run 2 21897.17 > > > > Run 3 3226746 0.00007231 0.00007871 0.00009087 Run 3 21906.82 > > > > Run 4 3191258 0.00007231 0.00007743 0.00009087 Run 4 21155.15 > > > > Run 5 3235653 0.00007231 0.00007743 0.00008703 Run 5 21397.06 > > > > Average 3210372.8 0.000072182 0.000077814 0.00009087 Average 21621.126 > > > > > > > > We need to talk about what we are measuring, and how to control the > > experiment setup to get reproducible results. > > Especially controlling on what CPU cores our code paths are executing. > > > > In above "baseline" case, we have two processes/tasks executing: > > (1) RX-napi softirq/thread (until napi_gro_receive deliver to socket) > > (2) Userspace netserver process TCP receiving data from socket. > > "baseline" in this case is still cpumap, just without these GRO patches. > > > > > My experience is that you will see two noticeable different > > throughput performance results depending on whether (1) and (2) is > > executing on the *same* CPU (multi-tasking context-switching), > > or executing in parallel (e.g. pinned) on two different CPU cores. > > > > The netperf command have an option > > > > -T lcpu,remcpu > > Request that netperf be bound to local CPU lcpu and/or netserver be > > bound to remote CPU rcpu. > > > > Verify setting by listing pinning like this: > > for PID in $(pidof netserver); do taskset -pc $PID ; done > > > > You can also set pinning runtime like this: > > export CPU=2; for PID in $(pidof netserver); do sudo taskset -pc $CPU $PID; > > done > > > > For troubleshooting, I like to use the periodic 1 sec (netperf -D1) > > output and adjust pinning runtime to observe the effect quickly. > > > > My experience is unfortunately that TCP results have a lot of variation > > (thanks for incliding 5 runs in your benchmarks), as it depends on tasks > > timing, that can get affected by CPU sleep states. The systems CPU > > latency setting can be seen in /dev/cpu_dma_latency, which can be read > > like this: > > > > sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency > > > > For playing with changing /dev/cpu_dma_latency I choose to use tuned-adm > > as it requires holding the file open. E.g I play with these profiles: > > > > sudo tuned-adm profile throughput-performance > > sudo tuned-adm profile latency-performance > > sudo tuned-adm profile network-latency > > Appreciate the tips - I should keep this saved somewhere. > > > > > > > > > cpumap v2 Olek > > > > > > > > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s) > > > > Run 1 3253651 0.00007167 0.00007807 0.00009343 Run 1 13497.57 > > > > Run 2 3221492 0.00007231 0.00007743 0.00009087 Run 2 12115.53 > > > > Run 3 3296453 0.00007039 0.00007807 0.00009087 Run 3 12323.38 > > > > Run 4 3254460 0.00007167 0.00007807 0.00009087 Run 4 12901.88 > > > > Run 5 3173327 0.00007295 0.00007871 0.00009215 Run 5 12593.22 > > > > Average 3239876.6 0.000071798 0.00007807 0.000091638 Average 12686.316 > > > > Delta 0.92% -0.53% 0.33% 0.85% -41.32% > > > > > > > > > > > > > > We now three processes/tasks executing: > > (1) RX-napi softirq/thread (doing XDP_REDIRECT into cpumap) > > (2) CPUmap kthread (until gro_receive_skb/gro_flush deliver to socket) > > (3) Userspace netserver process TCP receiving data from socket. > > > > Again, now the performance is going to depend on depending on which CPU > > cores the processes/tasks are running and whether some are sharing the > > same CPU. (There are both wakeup timing and cache-line effects). > > > > There are now more combinations to test... > > > > CPUmap is a CPU scaling facility, and you will likely also see different > > CPU utilization on the difference cores one you start to pin these to > > control the scenarios. > > > > > > It's very interesting that we see -40% tput w/ the patches. I went back > > > > > > > Sad that we see -40% throughput... but do we know what CPU cores the > > now three different tasks/processes run on(?) > > > > Roughly, yes. For context, my primary use case for cpumap is to provide > some degree of isolation between colocated containers on a single host. > In particular, colocation occurs on AMD Bergamo. And containers are > CPU pinned to their own CCX (roughly). My RX steering program ensures > RX packets destined to a specific container are cpumap redirected to any > of the container's pinned CPUs. It not only provides a good measure of > isolation but ensures resources are properly accounted. > > So to answer your question of which CPUs the 3 things run on: cpumap > kthread and application run on the same set of cores. More than that, > they share the same L3 cache by design. irq/softirq is effectively > random given default RSS config and IRQ affinities. > > > > > > > Oh no, I messed up something =\ > > > > Could you please also test not the whole series, but patches 1-3 (up to > > > "bpf:cpumap: switch to GRO...") and 1-4 (up to "bpf: cpumap: reuse skb > > > array...")? Would be great to see whether this implementation works > > > worse right from the start or I just broke something later on. > > > > > > > and double checked and it seems the numbers are right. Here's the > > > > some output from some profiles I took with: > > > > > > > > perf record -e cycles:k -a -- sleep 10 > > > > perf --no-pager diff perf.data.baseline perf.data.withpatches > ... > > > > > > > > # Event 'cycles:k' > > > > # Baseline Delta Abs Shared Object Symbol > > > > 6.13% -3.60% [kernel.kallsyms] [k] _copy_to_iter > > > > > > > I really appreciate that you provide perf data and perf diff, but as > > described above, we need data and information on what CPU cores are > > running which workload. > > > > Fortunately perf diff (and perf report) support doing like this: > > perf diff --sort=cpu,symbol > > > > But then you also need to control the CPUs used in experiment for the > > diff to work. > > > > I hope I made sense as these kind of CPU scaling benchmarks are tricky, > > Indeed, sounds quite tricky. > > My understanding with GRO is that it's a powerful general purpose > optimization. Enough that it should rise above the usual noise on a > reasonably configured system (which mine is). > > Maybe we can consider decoupling the cpumap GRO enablement with the > later optimizations? I agree. First, we need to identify the best approach to enable GRO on cpumap (between Olek's approach and what I have suggested) and then we can evaluate subsequent optimizations. @Olek: do you agree? Regards, Lorenzo > > So in Olek's above series, patches 1-3 seem like they would still > benefit from an simpler testbed. But the more targetted optimizations in > patch 4+ would probably justify a de-noised setup. Possibly single host > with xdp-trafficgen or something. > > Procedurally speaking, maybe it would save some wasted effort if > everyone agreed on the general approach before investing more time into > finer optimizations built on top of the basic GRO support? > > Thanks, > Daniel >
Attachment:
signature.asc
Description: PGP signature