On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@xxxxxxxxxxx> wrote: > > This patchset adds support for zero copy rx into userspace pages using > io_uring, eliminating a kernel to user copy. > > We configure a page pool that a driver uses to fill a hw rx queue to > hand out user pages instead of kernel pages. Any data that ends up > hitting this hw rx queue will thus be dma'd into userspace memory > directly, without needing to be bounced through kernel memory. 'Reading' > data out of a socket instead becomes a _notification_ mechanism, where > the kernel tells userspace where the data is. The overall approach is > similar to the devmem TCP proposal. > > This relies on hw header/data split, flow steering and RSS to ensure > packet headers remain in kernel memory and only desired flows hit a hw > rx queue configured for zero copy. Configuring this is outside of the > scope of this patchset. > > We share netdev core infra with devmem TCP. The main difference is that > io_uring is used for the uAPI and the lifetime of all objects are bound > to an io_uring instance. I've been thinking about this a bit, and I hope this feedback isn't too late, but I think your work may be useful for users not using io_uring. I.e. zero copy to host memory that is not dependent on page aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack. If we refactor things around a bit we should be able to have the memory tied to the RX queue similar to what AF_XDP does, and then we should be able to zero copy to the memory via regular sockets and via io_uring. This will be useful for us and other applications that would like to ZC similar to what you're doing here but not necessarily through io_uring. > Data is 'read' using a new io_uring request > type. When done, data is returned via a new shared refill queue. A zero > copy page pool refills a hw rx queue from this refill queue directly. Of > course, the lifetime of these data buffers are managed by io_uring > rather than the networking stack, with different refcounting rules. > > This patchset is the first step adding basic zero copy support. We will > extend this iteratively with new features e.g. dynamically allocated > zero copy areas, THP support, dmabuf support, improved copy fallback, > general optimisations and more. > > In terms of netdev support, we're first targeting Broadcom bnxt. Patches > aren't included since Taehee Yoo has already sent a more comprehensive > patchset adding support in [1]. Google gve should already support this, This is an aside, but GVE supports this via the out-of-tree patches I've been carrying on github. Uptsream we're working on adding the prerequisite page_pool support. > and Mellanox mlx5 support is WIP pending driver changes. > > =========== > Performance > =========== > > Test setup: > * AMD EPYC 9454 > * Broadcom BCM957508 200G > * Kernel v6.11 base [2] > * liburing fork [3] > * kperf fork [4] > * 4K MTU > * Single TCP flow > > With application thread + net rx softirq pinned to _different_ cores: > > epoll > 82.2 Gbps > > io_uring > 116.2 Gbps (+41%) > > Pinned to _same_ core: > > epoll > 62.6 Gbps > > io_uring > 80.9 Gbps (+29%) > Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy [1] and io_uring zerocopy respectively? If not, I would like to see a comparison between TCP RX zerocopy and this new io-uring zerocopy. For Google for example we use the TCP RX zerocopy, I would like to see perf numbers possibly motivating us to move to this new thing. [1] https://lwn.net/Articles/752046/ -- Thanks, Mina