On 10/9/24 20:32, Mina Almasry wrote:
On Wed, Oct 9, 2024 at 9:57 AM Jens Axboe <axboe@xxxxxxxxx> wrote:
On 10/9/24 10:55 AM, Mina Almasry wrote:
On Mon, Oct 7, 2024 at 3:16?PM David Wei <dw@xxxxxxxxxxx> wrote:
This patchset adds support for zero copy rx into userspace pages using
io_uring, eliminating a kernel to user copy.
We configure a page pool that a driver uses to fill a hw rx queue to
hand out user pages instead of kernel pages. Any data that ends up
hitting this hw rx queue will thus be dma'd into userspace memory
directly, without needing to be bounced through kernel memory. 'Reading'
data out of a socket instead becomes a _notification_ mechanism, where
the kernel tells userspace where the data is. The overall approach is
similar to the devmem TCP proposal.
This relies on hw header/data split, flow steering and RSS to ensure
packet headers remain in kernel memory and only desired flows hit a hw
rx queue configured for zero copy. Configuring this is outside of the
scope of this patchset.
We share netdev core infra with devmem TCP. The main difference is that
io_uring is used for the uAPI and the lifetime of all objects are bound
to an io_uring instance.
I've been thinking about this a bit, and I hope this feedback isn't
too late, but I think your work may be useful for users not using
io_uring. I.e. zero copy to host memory that is not dependent on page
aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
Not David, but come on, let's please get this moving forward. It's been
stuck behind dependencies for seemingly forever, which are finally
resolved.
Part of the reason this has been stuck behind dependencies for so long
is because the dependency took the time to implement things very
generically (memory providers, net_iovs) and provided you with the
primitives that enable your work. And dealt with nacks in this area
you now don't have to deal with.
And that's well appreciated, but I completely share Jens' sentiment.
Is there anything like uapi concerns that prevents it to be
implemented after / separately? I'd say that for io_uring users
it's nice to have the API done the io_uring way regardless of the
socket API option, so at the very least it would fork on the completion
format and that thing would need to have a different ring/etc.
I don't think this is a reasonable ask at all for this
patchset. If you want to work on that after the fact, then that's
certainly an option.
I think this work is extensible to sockets and the implementation need
not be heavily tied to io_uring; yes at least leaving things open for
a socket extension to be done easier in the future would be good, IMO
And as far as I can tell there is already a socket API allowing
all that called devmem TCP :) Might need slight improvement on
the registration side unless dmabuf wrapped user pages are good
enough.
I'll look at the series more closely to see if I actually have any
concrete feedback along these lines. I hope you're open to some of it
:-)
--
Pavel Begunkov