On Thu, Oct 10, 2024 at 5:29 PM David Wei <dw@xxxxxxxxxxx> wrote: > > On 2024-10-09 09:55, Mina Almasry wrote: > > On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@xxxxxxxxxxx> wrote: > >> > >> This patchset adds support for zero copy rx into userspace pages using > >> io_uring, eliminating a kernel to user copy. > >> > >> We configure a page pool that a driver uses to fill a hw rx queue to > >> hand out user pages instead of kernel pages. Any data that ends up > >> hitting this hw rx queue will thus be dma'd into userspace memory > >> directly, without needing to be bounced through kernel memory. 'Reading' > >> data out of a socket instead becomes a _notification_ mechanism, where > >> the kernel tells userspace where the data is. The overall approach is > >> similar to the devmem TCP proposal. > >> > >> This relies on hw header/data split, flow steering and RSS to ensure > >> packet headers remain in kernel memory and only desired flows hit a hw > >> rx queue configured for zero copy. Configuring this is outside of the > >> scope of this patchset. > >> > >> We share netdev core infra with devmem TCP. The main difference is that > >> io_uring is used for the uAPI and the lifetime of all objects are bound > >> to an io_uring instance. > > > > I've been thinking about this a bit, and I hope this feedback isn't > > too late, but I think your work may be useful for users not using > > io_uring. I.e. zero copy to host memory that is not dependent on page > > aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack. > > > > If we refactor things around a bit we should be able to have the > > memory tied to the RX queue similar to what AF_XDP does, and then we > > should be able to zero copy to the memory via regular sockets and via > > io_uring. This will be useful for us and other applications that would > > like to ZC similar to what you're doing here but not necessarily > > through io_uring. > > Using io_uring and trying to move away from a socket based interface is > an explicit longer term goal. I see your proposal of adding a > traditional socket based API as orthogonal to what we're trying to do. > If someone is motivated enough to see this exist then they can build it > themselves. > Yes, that was what I was suggesting. I (or whoever interested) would build it ourselves. Just calling out that your bits to bind umem to an rx-queue and/or the memory provider could be reused if it is re-usable (or can be made re-usable). From a quick look it seems fine, nothing requested here from this series. Sorry I made it seem I was asking you to implement a sockets extension :-) > > > >> Data is 'read' using a new io_uring request > >> type. When done, data is returned via a new shared refill queue. A zero > >> copy page pool refills a hw rx queue from this refill queue directly. Of > >> course, the lifetime of these data buffers are managed by io_uring > >> rather than the networking stack, with different refcounting rules. > >> > >> This patchset is the first step adding basic zero copy support. We will > >> extend this iteratively with new features e.g. dynamically allocated > >> zero copy areas, THP support, dmabuf support, improved copy fallback, > >> general optimisations and more. > >> > >> In terms of netdev support, we're first targeting Broadcom bnxt. Patches > >> aren't included since Taehee Yoo has already sent a more comprehensive > >> patchset adding support in [1]. Google gve should already support this, > > > > This is an aside, but GVE supports this via the out-of-tree patches > > I've been carrying on github. Uptsream we're working on adding the > > prerequisite page_pool support. > > > >> and Mellanox mlx5 support is WIP pending driver changes. > >> > >> =========== > >> Performance > >> =========== > >> > >> Test setup: > >> * AMD EPYC 9454 > >> * Broadcom BCM957508 200G > >> * Kernel v6.11 base [2] > >> * liburing fork [3] > >> * kperf fork [4] > >> * 4K MTU > >> * Single TCP flow > >> > >> With application thread + net rx softirq pinned to _different_ cores: > >> > >> epoll > >> 82.2 Gbps > >> > >> io_uring > >> 116.2 Gbps (+41%) > >> > >> Pinned to _same_ core: > >> > >> epoll > >> 62.6 Gbps > >> > >> io_uring > >> 80.9 Gbps (+29%) > >> > > > > Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy > > [1] and io_uring zerocopy respectively? > > > > If not, I would like to see a comparison between TCP RX zerocopy and > > this new io-uring zerocopy. For Google for example we use the TCP RX > > zerocopy, I would like to see perf numbers possibly motivating us to > > move to this new thing. > > No, it is comparing epoll without zero copy vs io_uring zero copy. Yes, > that's a fair request. I will add epoll with TCP_ZEROCOPY_RECEIVE to > kperf and compare. > Awesome to hear. For us, we do use TCP_ZEROCOPY_RECEIVE (with sockets), so I'm unsure how much benefit we'll see if we use this. Comparing against TCP_ZEROCOPY_RECEIVE will be more of an apple-to-apple comparison and also motivate folks using the old thing to switch to the new-thing. -- Thanks, Mina