Re: [PATCH v1 00/15] io_uring zero copy rx

Jens Axboe <axboe@xxxxxxxxx> · Thu, 10 Oct 2024 08:21:15 -0600

On 10/9/24 11:12 AM, Jens Axboe wrote:
> On 10/9/24 10:53 AM, Jens Axboe wrote:
>> On 10/9/24 10:50 AM, Jens Axboe wrote:
>>> On 10/9/24 10:35 AM, David Ahern wrote:
>>>> On 10/9/24 9:43 AM, Jens Axboe wrote:
>>>>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>>>>> as the sender, but then you're capped on the non-zc sender being too
>>>>> slow. The intel box does better, but it's still basically maxing out the
>>>>> sender at this point. So yeah, with a faster (or more efficient sender),
>>>>
>>>> I am surprised by this comment. You should not see a Tx limited test
>>>> (including CPU bound sender). Tx with ZC has been the easy option for a
>>>> while now.
>>>
>>> I just set this up to test yesterday and just used default! I'm sure
>>> there is a zc option, just not the default and hence it wasn't used.
>>> I'll give it a spin, will be useful for 200G testing.
>>
>> I think we're talking past each other. Yes send with zerocopy is
>> available for a while now, both with io_uring and just sendmsg(), but
>> I'm using kperf for testing and it does not look like it supports it.
>> Might have to add it... We'll see how far I can get without it.
> 
> Stanislav pointed me at:
> 
> https://github.com/facebookexperimental/kperf/pull/2
> 
> which adds zc send. I ran a quick test, and it does reduce cpu
> utilization on the sender from 100% to 95%. I'll keep poking...

Update on this - did more testing and the 100 -> 95 was a bit of a
fluke, it's still maxed. So I added io_uring send and sendzc support to
kperf, and I still saw the sendzc being maxed out sending at 100G rates
with 100% cpu usage.

Poked a bit, and the reason is that it's all memcpy() off
skb_orphan_frags_rx() -> skb_copy_ubufs(). At this point I asked Pavel
as that made no sense to me, and turns out the kernel thinks there's a
tap on the device. Maybe there is, haven't looked at that yet, but I
just killed the orphaning and tested again.

This looks better, now I can get 100G line rate from a single thread
using io_uring sendzc using only 30% of the single cpu/thread (including
irq time). That is good news, as it unlocks being able to test > 100G as
the sender is no longer the bottleneck.

Tap side still a mystery, but it unblocked testing. I'll figure that
part out separately.

-- 
Jens Axboe