From: Daniel Xu <dxu@xxxxxxxxx> Date: Thu, 5 Dec 2024 17:41:27 -0700 > On Thu, Dec 05, 2024 at 12:06:29PM GMT, Alexander Lobakin wrote: >> From: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx> >> Date: Thu, 5 Dec 2024 11:38:11 +0100 >> >>> From: Daniel Xu <dxu@xxxxxxxxx> >>> Date: Wed, 04 Dec 2024 13:51:08 -0800 >>> >>>> >>>> >>>> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote: >>>>> From: Jakub Kicinski <kuba@xxxxxxxxxx> >>>>> Date: Tue, 3 Dec 2024 16:51:57 -0800 >>>>> >>>>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote: >>>>>>>>> @ Jakub, >>>>>>>> >>>>>>>> Context? What doesn't work and why? >>>>>>> >>>>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP >>>>>>> trafficgen. Daniel tests TCP and the results are much worse than with >>>>>>> Lorenzo's implementation. >>>>>>> I suspect this is related to that how NAPI performs flushes / decides >>>>>>> whether to repoll again or exit vs how kthread does that (even though I >>>>>>> also try to flush only every 64 frames or when the ring is empty). Or >>>>>>> maybe to that part of the kthread happens in process context outside any >>>>>>> softirq, while when using NAPI, the whole loop is inside RX softirq. >>>>>>> >>>>>>> Jesper said that he'd like to see cpumap still using own kthread, so >>>>>>> that its priority can be boosted separately from the backlog. That's why >>>>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in >>>>>>> regards to all this :D >>>>>> >>>>>> Certainly not without a clear understanding what the problem with >>>>>> a kthread is. >>>>> >>>>> Yes, sure thing. >>>>> >>>>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I >>>>> was testing with the UDP trafficgen and got up to 80% improvement over >>>>> the baseline. Now I tested TCP and got up to 70% improvement, no >>>>> regressions whatsoever =\ >>>>> >>>>> I don't know where this regression on Daniel's setup comes from. Is it >>>>> multi-thread or single-thread test? >>>> >>>> 8 threads with 16 flows over them (-T8 -F16) >>>> >>>>> What app do you use: iperf, netperf, >>>>> neper, Microsoft's app (forgot the name)? >>>> >>>> neper, tcp_stream. >>> >>> Let me recheck with neper -T8 -F16, I'll post my results soon. >> >> kernel direct T1 direct T8F16 cpumap cpumap T8F16 >> clean 28 51 13 9 Gbps >> GRO 28 51 26 18 Gbps >> >> 100% gain, no regressions =\ >> >> My XDP prog is simple (upstream xdp-tools repo with no changes): >> >> numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p >> no-touch ens802f0np0 >> >> IOW it simply redirects everything to CPU 23 (same NUMA node) from any >> Rx queue without looking into headers or packet. >> Do you test with more sophisticated XDP prog? > > Great reminder... my prog is a bit more sophisticated. I forgot we were > doing latency tracking by inserting a timestamp into frame metadata. But > not clearing it after it was read on remote CPU, which disables GRO. So > previous test was paying the penalty of fixed GRO overhead without > getting any packet merges. > > Once I fixed up prog to reset metadata pointer I could see the wins. > Went from 21621.126 Mbps -> 25546.47 Mbps for a ~18% win in tput. No > latency changes. > > Sorry about the churn. No problem, crap happens sometimes :) Let me send my implementation on Monday-Wednesday. I'll include my UDP and TCP test results, as well as yours (+18%). BTW would be great if you could give me a Tested-by tag, as I assume the tests were fine and it works for you? Thanks, Olek