Re: [PATCH RFC 0/7] add socket to netdev page frag recycling support

David Ahern <dsahern@xxxxxxxxx> · Mon, 23 Aug 2021 20:34:11 -0700

On 8/22/21 9:32 PM, Yunsheng Lin wrote:
> 
> I assumed the "either Rx or Tx is cpu bound" meant either Rx or Tx is the
> bottleneck?

yes.

> 
> It seems iperf3 support the Tx ZC, I retested using the iperf3, Rx settings
> is not changed when testing, MTU is 1500:

-Z == sendfile API. That works fine to a point and that point is well
below 100G.

I mean TCP with MSG_ZEROCOPY and SO_ZEROCOPY.

> 
> IOMMU in strict mode:
> 1. Tx ZC case:
>    22Gbit with Tx being bottleneck(cpu bound)
> 2. Tx non-ZC case with pfrag pool enabled:
>    40Git with Rx being bottleneck(cpu bound)
> 3. Tx non-ZC case with pfrag pool disabled:
>    30Git, the bottleneck seems not to be cpu bound, as the Rx and Tx does
>    not have a single CPU reaching about 100% usage.
> 
>>
>> At 1500 MTU lowering CPU usage on the Tx side does not accomplish much
>> on throughput since the Rx is 100% cpu.
> 
> As above performance data, enabling ZC does not seems to help when IOMMU
> is involved, which has about 30% performance degrade when pfrag pool is
> disabled and 50% performance degrade when pfrag pool is enabled.

In a past response you should numbers for Tx ZC API with a custom
program. That program showed the dramatic reduction in CPU cycles for Tx
with the ZC API.

> 
>>
>> At 3300 MTU you have ~47% the pps for the same throughput. Lower pps
>> reduces Rx processing and lower CPU to process the incoming stream. Then
>> using the Tx ZC API you lower the Tx overehad allowing a single stream
>> to faster - sending more data which in the end results in much higher
>> pps and throughput. At the limit you are CPU bound (both ends in my
>> testing as Rx side approaches the max pps, and Tx side as it continually
>> tries to send data).
>>
>> Lowering CPU usage on Tx the side is a win regardless of whether there
>> is a big increase on the throughput at 1500 MTU since that configuration
>> is an Rx CPU bound problem. Hence, my point that we have a good start
>> point for lowering CPU usage on the Tx side; we should improve it rather
>> than add per-socket page pools.
> 
> Acctually it is not a per-socket page pools, the page pool is still per
> NAPI, this patchset adds multi allocation context to the page pool, so that
> the tx can reuse the same page pool with rx, which is quite usefully if the
> ARFS is enabled.
> 
>>
>> You can stress the Tx side and emphasize its overhead by modifying the
>> receiver to drop the data on Rx rather than copy to userspace which is a
>> huge bottleneck (e.g., MSG_TRUNC on recv). This allows the single flow
> 
> As the frag page is supported in page pool for Rx, the Rx probably is not
> a bottleneck any more, at least not for IOMMU in strict mode.
> 
> It seems iperf3 does not support MSG_TRUNC yet, any testing tool supporting
> MSG_TRUNC? Or do I have to hack the kernel or iperf3 tool to do that?

https://github.com/dsahern/iperf, mods branch

--zc_api is the Tx ZC API; --rx_drop adds MSG_TRUNC to recv.

> 
>> stream to go faster and emphasize Tx bottlenecks as the pps at 3300
>> approaches the top pps at 1500. e.g., doing this with iperf3 shows the
>> spinlock overhead with tcp_sendmsg, overhead related to 'select' and
>> then gup_pgd_range.
> 
> When IOMMU is in strict mode, the overhead with IOMMU seems to be much
> bigger than spinlock(23% to 10%).
> 
> Anyway, I still think ZC mostly benefit to packet which is bigger than a
> specific size and IOMMU disabling case.
> 
> 
>> .
>>