On Fri, Dec 6, 2024, at 7:06 AM, Alexander Lobakin wrote: > From: Daniel Xu <dxu@xxxxxxxxx> > Date: Thu, 5 Dec 2024 17:41:27 -0700 > >> On Thu, Dec 05, 2024 at 12:06:29PM GMT, Alexander Lobakin wrote: >>> From: Alexander Lobakin <aleksander.lobakin@xxxxxxxxx> >>> Date: Thu, 5 Dec 2024 11:38:11 +0100 >>> >>>> From: Daniel Xu <dxu@xxxxxxxxx> >>>> Date: Wed, 04 Dec 2024 13:51:08 -0800 >>>> >>>>> >>>>> >>>>> On Wed, Dec 4, 2024, at 8:42 AM, Alexander Lobakin wrote: >>>>>> From: Jakub Kicinski <kuba@xxxxxxxxxx> >>>>>> Date: Tue, 3 Dec 2024 16:51:57 -0800 >>>>>> >>>>>>> On Tue, 3 Dec 2024 12:01:16 +0100 Alexander Lobakin wrote: >>>>>>>>>> @ Jakub, >>>>>>>>> >>>>>>>>> Context? What doesn't work and why? >>>>>>>> >>>>>>>> My tests show the same perf as on Lorenzo's series, but I test with UDP >>>>>>>> trafficgen. Daniel tests TCP and the results are much worse than with >>>>>>>> Lorenzo's implementation. >>>>>>>> I suspect this is related to that how NAPI performs flushes / decides >>>>>>>> whether to repoll again or exit vs how kthread does that (even though I >>>>>>>> also try to flush only every 64 frames or when the ring is empty). Or >>>>>>>> maybe to that part of the kthread happens in process context outside any >>>>>>>> softirq, while when using NAPI, the whole loop is inside RX softirq. >>>>>>>> >>>>>>>> Jesper said that he'd like to see cpumap still using own kthread, so >>>>>>>> that its priority can be boosted separately from the backlog. That's why >>>>>>>> we asked you whether it would be fine to have cpumap as threaded NAPI in >>>>>>>> regards to all this :D >>>>>>> >>>>>>> Certainly not without a clear understanding what the problem with >>>>>>> a kthread is. >>>>>> >>>>>> Yes, sure thing. >>>>>> >>>>>> Bad thing's that I can't reproduce Daniel's problem >_< Previously, I >>>>>> was testing with the UDP trafficgen and got up to 80% improvement over >>>>>> the baseline. Now I tested TCP and got up to 70% improvement, no >>>>>> regressions whatsoever =\ >>>>>> >>>>>> I don't know where this regression on Daniel's setup comes from. Is it >>>>>> multi-thread or single-thread test? >>>>> >>>>> 8 threads with 16 flows over them (-T8 -F16) >>>>> >>>>>> What app do you use: iperf, netperf, >>>>>> neper, Microsoft's app (forgot the name)? >>>>> >>>>> neper, tcp_stream. >>>> >>>> Let me recheck with neper -T8 -F16, I'll post my results soon. >>> >>> kernel direct T1 direct T8F16 cpumap cpumap T8F16 >>> clean 28 51 13 9 Gbps >>> GRO 28 51 26 18 Gbps >>> >>> 100% gain, no regressions =\ >>> >>> My XDP prog is simple (upstream xdp-tools repo with no changes): >>> >>> numactl -N 0 xdp-tools/xdp-bench/xdp-bench redirect-cpu -c 23 -s -p >>> no-touch ens802f0np0 >>> >>> IOW it simply redirects everything to CPU 23 (same NUMA node) from any >>> Rx queue without looking into headers or packet. >>> Do you test with more sophisticated XDP prog? >> >> Great reminder... my prog is a bit more sophisticated. I forgot we were >> doing latency tracking by inserting a timestamp into frame metadata. But >> not clearing it after it was read on remote CPU, which disables GRO. So >> previous test was paying the penalty of fixed GRO overhead without >> getting any packet merges. >> >> Once I fixed up prog to reset metadata pointer I could see the wins. >> Went from 21621.126 Mbps -> 25546.47 Mbps for a ~18% win in tput. No >> latency changes. >> >> Sorry about the churn. > > No problem, crap happens sometimes :) > > Let me send my implementation on Monday-Wednesday. I'll include my UDP > and TCP test results, as well as yours (+18%). > > BTW would be great if you could give me a Tested-by tag, as I assume the > tests were fine and it works for you? Yep, worked great for me. Tested-by: Daniel Xu <dxu@xxxxxxxxx>