On 10/9/24 11:12 AM, Jens Axboe wrote: > On 10/9/24 10:53 AM, Jens Axboe wrote: >> On 10/9/24 10:50 AM, Jens Axboe wrote: >>> On 10/9/24 10:35 AM, David Ahern wrote: >>>> On 10/9/24 9:43 AM, Jens Axboe wrote: >>>>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box >>>>> as the sender, but then you're capped on the non-zc sender being too >>>>> slow. The intel box does better, but it's still basically maxing out the >>>>> sender at this point. So yeah, with a faster (or more efficient sender), >>>> >>>> I am surprised by this comment. You should not see a Tx limited test >>>> (including CPU bound sender). Tx with ZC has been the easy option for a >>>> while now. >>> >>> I just set this up to test yesterday and just used default! I'm sure >>> there is a zc option, just not the default and hence it wasn't used. >>> I'll give it a spin, will be useful for 200G testing. >> >> I think we're talking past each other. Yes send with zerocopy is >> available for a while now, both with io_uring and just sendmsg(), but >> I'm using kperf for testing and it does not look like it supports it. >> Might have to add it... We'll see how far I can get without it. > > Stanislav pointed me at: > > https://github.com/facebookexperimental/kperf/pull/2 > > which adds zc send. I ran a quick test, and it does reduce cpu > utilization on the sender from 100% to 95%. I'll keep poking... Update on this - did more testing and the 100 -> 95 was a bit of a fluke, it's still maxed. So I added io_uring send and sendzc support to kperf, and I still saw the sendzc being maxed out sending at 100G rates with 100% cpu usage. Poked a bit, and the reason is that it's all memcpy() off skb_orphan_frags_rx() -> skb_copy_ubufs(). At this point I asked Pavel as that made no sense to me, and turns out the kernel thinks there's a tap on the device. Maybe there is, haven't looked at that yet, but I just killed the orphaning and tested again. This looks better, now I can get 100G line rate from a single thread using io_uring sendzc using only 30% of the single cpu/thread (including irq time). That is good news, as it unlocks being able to test > 100G as the sender is no longer the bottleneck. Tap side still a mystery, but it unblocked testing. I'll figure that part out separately. -- Jens Axboe