On 1/17/25 3:42 PM, Pavel Begunkov wrote: > On 1/17/25 14:28, Paolo Abeni wrote: >> On 1/17/25 12:16 AM, David Wei wrote: >>> This patchset adds support for zero copy rx into userspace pages using >>> io_uring, eliminating a kernel to user copy. >>> >>> We configure a page pool that a driver uses to fill a hw rx queue to >>> hand out user pages instead of kernel pages. Any data that ends up >>> hitting this hw rx queue will thus be dma'd into userspace memory >>> directly, without needing to be bounced through kernel memory. 'Reading' >>> data out of a socket instead becomes a _notification_ mechanism, where >>> the kernel tells userspace where the data is. The overall approach is >>> similar to the devmem TCP proposal. >>> >>> This relies on hw header/data split, flow steering ad RSS to ensure >>> packet headers remain in kernel memory and only desired flows hit a hw >>> rx queue configured for zero copy. Configuring this is outside of the >>> scope of this patchset. >>> >>> We share netdev core infra with devmem TCP. The main difference is that >>> io_uring is used for the uAPI and the lifetime of all objects are bound >>> to an io_uring instance. Data is 'read' using a new io_uring request >>> type. When done, data is returned via a new shared refill queue. A zero >>> copy page pool refills a hw rx queue from this refill queue directly. Of >>> course, the lifetime of these data buffers are managed by io_uring >>> rather than the networking stack, with different refcounting rules. >>> >>> This patchset is the first step adding basic zero copy support. We will >>> extend this iteratively with new features e.g. dynamically allocated >>> zero copy areas, THP support, dmabuf support, improved copy fallback, >>> general optimisations and more. >>> >>> In terms of netdev support, we're first targeting Broadcom bnxt. Patches >>> aren't included since Taehee Yoo has already sent a more comprehensive >>> patchset adding support in [1]. Google gve should already support this, >>> and Mellanox mlx5 support is WIP pending driver changes. >>> >>> =========== >>> Performance >>> =========== >>> >>> Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet. >>> >>> Test setup: >>> * AMD EPYC 9454 >>> * Broadcom BCM957508 200G >>> * Kernel v6.11 base [2] >>> * liburing fork [3] >>> * kperf fork [4] >>> * 4K MTU >>> * Single TCP flow >>> >>> With application thread + net rx softirq pinned to _different_ cores: >>> >>> +-------------------------------+ >>> | epoll | io_uring | >>> |-----------|-------------------| >>> | 82.2 Gbps | 116.2 Gbps (+41%) | >>> +-------------------------------+ >>> >>> Pinned to _same_ core: >>> >>> +-------------------------------+ >>> | epoll | io_uring | >>> |-----------|-------------------| >>> | 62.6 Gbps | 80.9 Gbps (+29%) | >>> +-------------------------------+ >>> >>> ===== >>> Links >>> ===== >>> >>> Broadcom bnxt support: >>> [1]: https://lore.kernel.org/netdev/20241003160620.1521626-8-ap420073@xxxxxxxxx/ >>> >>> Linux kernel branch: >>> [2]: https://github.com/spikeh/linux.git zcrx/v9 >>> >>> liburing for testing: >>> [3]: https://github.com/isilence/liburing.git zcrx/next >>> >>> kperf for testing: >>> [4]: https://git.kernel.dk/kperf.git >> >> We are getting very close to the merge window. In order to get this >> series merged before such deadline the point raised by Jakub on this >> version must me resolved, the next iteration should land to the ML >> before the end of the current working day and the series must apply >> cleanly to net-next, so that it can be processed by our CI. > > Sounds good, thanks Paolo. > > Since the merging is not trivial, I'll send a PR for the net/ > patches instead of reposting the entire thing, if that sounds right > to you. The rest will be handled on the io_uring side. I agree it is the more straight-forward path. @Jakub: do you see any problem with the above? /P