Re: [PATCH v1 00/15] io_uring zero copy rx

Mina Almasry <almasrymina@xxxxxxxxxx> · Fri, 11 Oct 2024 12:43:20 -0700

On Thu, Oct 10, 2024 at 5:29 PM David Wei <dw@xxxxxxxxxxx> wrote:
>
> On 2024-10-09 09:55, Mina Almasry wrote:
> > On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@xxxxxxxxxxx> wrote:
> >>
> >> This patchset adds support for zero copy rx into userspace pages using
> >> io_uring, eliminating a kernel to user copy.
> >>
> >> We configure a page pool that a driver uses to fill a hw rx queue to
> >> hand out user pages instead of kernel pages. Any data that ends up
> >> hitting this hw rx queue will thus be dma'd into userspace memory
> >> directly, without needing to be bounced through kernel memory. 'Reading'
> >> data out of a socket instead becomes a _notification_ mechanism, where
> >> the kernel tells userspace where the data is. The overall approach is
> >> similar to the devmem TCP proposal.
> >>
> >> This relies on hw header/data split, flow steering and RSS to ensure
> >> packet headers remain in kernel memory and only desired flows hit a hw
> >> rx queue configured for zero copy. Configuring this is outside of the
> >> scope of this patchset.
> >>
> >> We share netdev core infra with devmem TCP. The main difference is that
> >> io_uring is used for the uAPI and the lifetime of all objects are bound
> >> to an io_uring instance.
> >
> > I've been thinking about this a bit, and I hope this feedback isn't
> > too late, but I think your work may be useful for users not using
> > io_uring. I.e. zero copy to host memory that is not dependent on page
> > aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
> >
> > If we refactor things around a bit we should be able to have the
> > memory tied to the RX queue similar to what AF_XDP does, and then we
> > should be able to zero copy to the memory via regular sockets and via
> > io_uring. This will be useful for us and other applications that would
> > like to ZC similar to what you're doing here but not necessarily
> > through io_uring.
>
> Using io_uring and trying to move away from a socket based interface is
> an explicit longer term goal. I see your proposal of adding a
> traditional socket based API as orthogonal to what we're trying to do.
> If someone is motivated enough to see this exist then they can build it
> themselves.
>

Yes, that was what I was suggesting. I (or whoever interested) would
build it ourselves. Just calling out that your bits to bind umem to an
rx-queue and/or the memory provider could be reused if it is re-usable
(or can be made re-usable). From a quick look it seems fine, nothing
requested here from this series. Sorry I made it seem I was asking you
to implement a sockets extension :-)

> >
> >> Data is 'read' using a new io_uring request
> >> type. When done, data is returned via a new shared refill queue. A zero
> >> copy page pool refills a hw rx queue from this refill queue directly. Of
> >> course, the lifetime of these data buffers are managed by io_uring
> >> rather than the networking stack, with different refcounting rules.
> >>
> >> This patchset is the first step adding basic zero copy support. We will
> >> extend this iteratively with new features e.g. dynamically allocated
> >> zero copy areas, THP support, dmabuf support, improved copy fallback,
> >> general optimisations and more.
> >>
> >> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
> >> aren't included since Taehee Yoo has already sent a more comprehensive
> >> patchset adding support in [1]. Google gve should already support this,
> >
> > This is an aside, but GVE supports this via the out-of-tree patches
> > I've been carrying on github. Uptsream we're working on adding the
> > prerequisite page_pool support.
> >
> >> and Mellanox mlx5 support is WIP pending driver changes.
> >>
> >> ===========
> >> Performance
> >> ===========
> >>
> >> Test setup:
> >> * AMD EPYC 9454
> >> * Broadcom BCM957508 200G
> >> * Kernel v6.11 base [2]
> >> * liburing fork [3]
> >> * kperf fork [4]
> >> * 4K MTU
> >> * Single TCP flow
> >>
> >> With application thread + net rx softirq pinned to _different_ cores:
> >>
> >> epoll
> >> 82.2 Gbps
> >>
> >> io_uring
> >> 116.2 Gbps (+41%)
> >>
> >> Pinned to _same_ core:
> >>
> >> epoll
> >> 62.6 Gbps
> >>
> >> io_uring
> >> 80.9 Gbps (+29%)
> >>
> >
> > Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy
> > [1]  and io_uring zerocopy respectively?
> >
> > If not, I would like to see a comparison between TCP RX zerocopy and
> > this new io-uring zerocopy. For Google for example we use the TCP RX
> > zerocopy, I would like to see perf numbers possibly motivating us to
> > move to this new thing.
>
> No, it is comparing epoll without zero copy vs io_uring zero copy. Yes,
> that's a fair request. I will add epoll with TCP_ZEROCOPY_RECEIVE to
> kperf and compare.
>

Awesome to hear. For us, we do use TCP_ZEROCOPY_RECEIVE (with
sockets), so I'm unsure how much benefit we'll see if we use this.
Comparing against TCP_ZEROCOPY_RECEIVE will be more of an
apple-to-apple comparison and also motivate folks using the old thing
to switch to the new-thing.

-- 
Thanks,
Mina