Re: [PATCH v1 00/15] io_uring zero copy rx

Jens Axboe <axboe@xxxxxxxxx> · Thu, 10 Oct 2024 12:11:54 -0600

On 10/10/24 8:21 AM, Jens Axboe wrote:
> On 10/9/24 11:12 AM, Jens Axboe wrote:
>> On 10/9/24 10:53 AM, Jens Axboe wrote:
>>> On 10/9/24 10:50 AM, Jens Axboe wrote:
>>>> On 10/9/24 10:35 AM, David Ahern wrote:
>>>>> On 10/9/24 9:43 AM, Jens Axboe wrote:
>>>>>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>>>>>> as the sender, but then you're capped on the non-zc sender being too
>>>>>> slow. The intel box does better, but it's still basically maxing out the
>>>>>> sender at this point. So yeah, with a faster (or more efficient sender),
>>>>>
>>>>> I am surprised by this comment. You should not see a Tx limited test
>>>>> (including CPU bound sender). Tx with ZC has been the easy option for a
>>>>> while now.
>>>>
>>>> I just set this up to test yesterday and just used default! I'm sure
>>>> there is a zc option, just not the default and hence it wasn't used.
>>>> I'll give it a spin, will be useful for 200G testing.
>>>
>>> I think we're talking past each other. Yes send with zerocopy is
>>> available for a while now, both with io_uring and just sendmsg(), but
>>> I'm using kperf for testing and it does not look like it supports it.
>>> Might have to add it... We'll see how far I can get without it.
>>
>> Stanislav pointed me at:
>>
>> https://github.com/facebookexperimental/kperf/pull/2
>>
>> which adds zc send. I ran a quick test, and it does reduce cpu
>> utilization on the sender from 100% to 95%. I'll keep poking...
> 
> Update on this - did more testing and the 100 -> 95 was a bit of a
> fluke, it's still maxed. So I added io_uring send and sendzc support to
> kperf, and I still saw the sendzc being maxed out sending at 100G rates
> with 100% cpu usage.
> 
> Poked a bit, and the reason is that it's all memcpy() off
> skb_orphan_frags_rx() -> skb_copy_ubufs(). At this point I asked Pavel
> as that made no sense to me, and turns out the kernel thinks there's a
> tap on the device. Maybe there is, haven't looked at that yet, but I
> just killed the orphaning and tested again.
> 
> This looks better, now I can get 100G line rate from a single thread
> using io_uring sendzc using only 30% of the single cpu/thread (including
> irq time). That is good news, as it unlocks being able to test > 100G as
> the sender is no longer the bottleneck.
> 
> Tap side still a mystery, but it unblocked testing. I'll figure that
> part out separately.

Further update - the above mystery was dhclient, thanks a lot to David
for being able to figure that out very quickly.

But the more interesting update - I got both links up on the receiving
side, providing 200G of bandwidth. I re-ran the test, with proper zero
copy running on the sending side, and io_uring zcrx on the receiver. The
receiver is two threads, BUT targeting the same queue on the two nics.
Both receiver threads bound to the same core (453 in this case). In
other words, single cpu thread is running all of both rx threads, napi
included.

Basic thread usage from top here:

10816 root      20   0  396640 393224      0 R  49.0   0.0   0:01.77 server
10818 root      20   0  396640 389128      0 R  49.0   0.0   0:01.76 server      

and I get 98.4Gbps and 98.6Gbps on the receiver side, which is basically
the combined link bw again. So 200G not enough to saturate a single cpu
thread.

-- 
Jens Axboe