Re: [RFC PATCH v3 10/12] tcp: RX path for devmem TCP

Mina Almasry <almasrymina@xxxxxxxxxx> · Wed, 8 Nov 2023 18:39:11 -0800

On Tue, Nov 7, 2023 at 4:01 PM David Ahern <dsahern@xxxxxxxxxx> wrote:
>
> On 11/7/23 4:55 PM, Mina Almasry wrote:
> > On Mon, Nov 6, 2023 at 4:03 PM Willem de Bruijn
> > <willemdebruijn.kernel@xxxxxxxxx> wrote:
> >>
> >> On Mon, Nov 6, 2023 at 3:55 PM David Ahern <dsahern@xxxxxxxxxx> wrote:
> >>>
> >>> On 11/6/23 4:32 PM, Stanislav Fomichev wrote:
> >>>>> The concise notification API returns tokens as a range for
> >>>>> compression, encoding as two 32-bit unsigned integers start + length.
> >>>>> It allows for even further batching by returning multiple such ranges
> >>>>> in a single call.
> >>>>
> >>>> Tangential: should tokens be u64? Otherwise we can't have more than
> >>>> 4gb unacknowledged. Or that's a reasonable constraint?
> >>>>
> >>>
> >>> Was thinking the same and with bits reserved for a dmabuf id to allow
> >>> multiple dmabufs in a single rx queue (future extension, but build the
> >>> capability in now). e.g., something like a 37b offset (128GB dmabuf
> >>> size), 19b length (large GRO), 8b dmabuf id (lots of dmabufs to a queue).
> >>
> >> Agreed. Converting to 64b now sounds like a good forward looking revision.
> >
> > The concept of IDing a dma-buf came up in a couple of different
> > contexts. First, in the context of us giving the dma-buf ID to the
> > user on recvmsg() to tell the user the data is in this specific
> > dma-buf. The second context is here, to bind dma-bufs with multiple
> > user-visible IDs to an rx queue.
> >
> > My issue here is that I don't see anything in the struct dma_buf that
> > can practically serve as an ID:
> >
> > https://elixir.bootlin.com/linux/v6.6-rc7/source/include/linux/dma-buf.h#L302
> >
> > Actually, from the userspace, only the name of the dma-buf seems
> > queryable. That's only unique if the user sets it as such. The dmabuf
> > FD can't serve as an ID. For our use case we need to support 1 process
> > doing the dma-buf bind via netlink, sharing the dma-buf FD to another
> > process, and that process receives the data.  In this case the FDs
> > shown by the 2 processes may be different. Converting to 64b is a
> > trivial change I can make now, but I'm not sure how to ID these
> > dma-bufs. Suggestions welcome. I'm not sure the dma-buf guys will
> > allow adding a new ID + APIs to query said dma-buf ID.
> >
>
> The API can be unique to this usage: e.g., add a dmabuf id to the
> netlink API. Userspace manages the ids (tells the kernel what value to
> use with an instance), the kernel validates no 2 dmabufs have the same
> id and then returns the value here.
>
>

Seems reasonable, will do.

On Wed, Nov 8, 2023 at 7:36 AM Edward Cree <ecree.xilinx@xxxxxxxxx> wrote:
>
> On 06/11/2023 21:17, Stanislav Fomichev wrote:
> > I guess I'm just wondering whether other people have any suggestions
> > here. Not sure Jonathan's way was better, but we fundamentally
> > have two queues between the kernel and the userspace:
> > - userspace receiving tokens (recvmsg + magical flag)
> > - userspace refilling tokens (setsockopt + magical flag)
> >
> > So having some kind of shared memory producer-consumer queue feels natural.
> > And using 'classic' socket api here feels like a stretch, idk.
>
> Do 'refilled tokens' (returned memory areas) get used for anything other
>  than subsequent RX?

Hi Ed!

Not really, it's only the subsequent RX.

>  If not then surely the way to return a memory area
>  in an io_uring idiom is just to post a new read sqe ('RX descriptor')
>  pointing into it, rather than explicitly returning it with setsockopt.

We're interested in using this with regular TCP sockets, not
necessarily io_uring. The io_uring interface to devmem TCP may very
well use what you suggest and can drop the setsockopt.

> (Being async means you can post lots of these, unlike recvmsg(), so you
>  don't need any kernel management to keep the RX queue filled; it can
>  just be all handled by the userland thus simplifying APIs overall.)
> Or I'm misunderstanding something?
>
> -e

--
Thanks,
Mina