On 2024-10-09 09:55, Mina Almasry wrote: > On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@xxxxxxxxxxx> wrote: >> >> This patchset adds support for zero copy rx into userspace pages using >> io_uring, eliminating a kernel to user copy. >> >> We configure a page pool that a driver uses to fill a hw rx queue to >> hand out user pages instead of kernel pages. Any data that ends up >> hitting this hw rx queue will thus be dma'd into userspace memory >> directly, without needing to be bounced through kernel memory. 'Reading' >> data out of a socket instead becomes a _notification_ mechanism, where >> the kernel tells userspace where the data is. The overall approach is >> similar to the devmem TCP proposal. >> >> This relies on hw header/data split, flow steering and RSS to ensure >> packet headers remain in kernel memory and only desired flows hit a hw >> rx queue configured for zero copy. Configuring this is outside of the >> scope of this patchset. >> >> We share netdev core infra with devmem TCP. The main difference is that >> io_uring is used for the uAPI and the lifetime of all objects are bound >> to an io_uring instance. > > I've been thinking about this a bit, and I hope this feedback isn't > too late, but I think your work may be useful for users not using > io_uring. I.e. zero copy to host memory that is not dependent on page > aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack. > > If we refactor things around a bit we should be able to have the > memory tied to the RX queue similar to what AF_XDP does, and then we > should be able to zero copy to the memory via regular sockets and via > io_uring. This will be useful for us and other applications that would > like to ZC similar to what you're doing here but not necessarily > through io_uring. Using io_uring and trying to move away from a socket based interface is an explicit longer term goal. I see your proposal of adding a traditional socket based API as orthogonal to what we're trying to do. If someone is motivated enough to see this exist then they can build it themselves. > >> Data is 'read' using a new io_uring request >> type. When done, data is returned via a new shared refill queue. A zero >> copy page pool refills a hw rx queue from this refill queue directly. Of >> course, the lifetime of these data buffers are managed by io_uring >> rather than the networking stack, with different refcounting rules. >> >> This patchset is the first step adding basic zero copy support. We will >> extend this iteratively with new features e.g. dynamically allocated >> zero copy areas, THP support, dmabuf support, improved copy fallback, >> general optimisations and more. >> >> In terms of netdev support, we're first targeting Broadcom bnxt. Patches >> aren't included since Taehee Yoo has already sent a more comprehensive >> patchset adding support in [1]. Google gve should already support this, > > This is an aside, but GVE supports this via the out-of-tree patches > I've been carrying on github. Uptsream we're working on adding the > prerequisite page_pool support. > >> and Mellanox mlx5 support is WIP pending driver changes. >> >> =========== >> Performance >> =========== >> >> Test setup: >> * AMD EPYC 9454 >> * Broadcom BCM957508 200G >> * Kernel v6.11 base [2] >> * liburing fork [3] >> * kperf fork [4] >> * 4K MTU >> * Single TCP flow >> >> With application thread + net rx softirq pinned to _different_ cores: >> >> epoll >> 82.2 Gbps >> >> io_uring >> 116.2 Gbps (+41%) >> >> Pinned to _same_ core: >> >> epoll >> 62.6 Gbps >> >> io_uring >> 80.9 Gbps (+29%) >> > > Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy > [1] and io_uring zerocopy respectively? > > If not, I would like to see a comparison between TCP RX zerocopy and > this new io-uring zerocopy. For Google for example we use the TCP RX > zerocopy, I would like to see perf numbers possibly motivating us to > move to this new thing. No, it is comparing epoll without zero copy vs io_uring zero copy. Yes, that's a fair request. I will add epoll with TCP_ZEROCOPY_RECEIVE to kperf and compare. > > [1] https://lwn.net/Articles/752046/ > >