On 8/22/21 9:32 PM, Yunsheng Lin wrote: > > I assumed the "either Rx or Tx is cpu bound" meant either Rx or Tx is the > bottleneck? yes. > > It seems iperf3 support the Tx ZC, I retested using the iperf3, Rx settings > is not changed when testing, MTU is 1500: -Z == sendfile API. That works fine to a point and that point is well below 100G. I mean TCP with MSG_ZEROCOPY and SO_ZEROCOPY. > > IOMMU in strict mode: > 1. Tx ZC case: > 22Gbit with Tx being bottleneck(cpu bound) > 2. Tx non-ZC case with pfrag pool enabled: > 40Git with Rx being bottleneck(cpu bound) > 3. Tx non-ZC case with pfrag pool disabled: > 30Git, the bottleneck seems not to be cpu bound, as the Rx and Tx does > not have a single CPU reaching about 100% usage. > >> >> At 1500 MTU lowering CPU usage on the Tx side does not accomplish much >> on throughput since the Rx is 100% cpu. > > As above performance data, enabling ZC does not seems to help when IOMMU > is involved, which has about 30% performance degrade when pfrag pool is > disabled and 50% performance degrade when pfrag pool is enabled. In a past response you should numbers for Tx ZC API with a custom program. That program showed the dramatic reduction in CPU cycles for Tx with the ZC API. > >> >> At 3300 MTU you have ~47% the pps for the same throughput. Lower pps >> reduces Rx processing and lower CPU to process the incoming stream. Then >> using the Tx ZC API you lower the Tx overehad allowing a single stream >> to faster - sending more data which in the end results in much higher >> pps and throughput. At the limit you are CPU bound (both ends in my >> testing as Rx side approaches the max pps, and Tx side as it continually >> tries to send data). >> >> Lowering CPU usage on Tx the side is a win regardless of whether there >> is a big increase on the throughput at 1500 MTU since that configuration >> is an Rx CPU bound problem. Hence, my point that we have a good start >> point for lowering CPU usage on the Tx side; we should improve it rather >> than add per-socket page pools. > > Acctually it is not a per-socket page pools, the page pool is still per > NAPI, this patchset adds multi allocation context to the page pool, so that > the tx can reuse the same page pool with rx, which is quite usefully if the > ARFS is enabled. > >> >> You can stress the Tx side and emphasize its overhead by modifying the >> receiver to drop the data on Rx rather than copy to userspace which is a >> huge bottleneck (e.g., MSG_TRUNC on recv). This allows the single flow > > As the frag page is supported in page pool for Rx, the Rx probably is not > a bottleneck any more, at least not for IOMMU in strict mode. > > It seems iperf3 does not support MSG_TRUNC yet, any testing tool supporting > MSG_TRUNC? Or do I have to hack the kernel or iperf3 tool to do that? https://github.com/dsahern/iperf, mods branch --zc_api is the Tx ZC API; --rx_drop adds MSG_TRUNC to recv. > >> stream to go faster and emphasize Tx bottlenecks as the pps at 3300 >> approaches the top pps at 1500. e.g., doing this with iperf3 shows the >> spinlock overhead with tcp_sendmsg, overhead related to 'select' and >> then gup_pgd_range. > > When IOMMU is in strict mode, the overhead with IOMMU seems to be much > bigger than spinlock(23% to 10%). > > Anyway, I still think ZC mostly benefit to packet which is bigger than a > specific size and IOMMU disabling case. > > >> . >>