On 11/06, Mina Almasry wrote: > On Mon, Nov 6, 2023 at 10:44 AM Stanislav Fomichev <sdf@xxxxxxxxxx> wrote: > > > > On 11/05, Mina Almasry wrote: > > > In tcp_recvmsg_locked(), detect if the skb being received by the user > > > is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM > > > flag - pass it to tcp_recvmsg_devmem() for custom handling. > > > > > > tcp_recvmsg_devmem() copies any data in the skb header to the linear > > > buffer, and returns a cmsg to the user indicating the number of bytes > > > returned in the linear buffer. > > > > > > tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, > > > and returns to the user a cmsg_devmem indicating the location of the > > > data in the dmabuf device memory. cmsg_devmem contains this information: > > > > > > 1. the offset into the dmabuf where the payload starts. 'frag_offset'. > > > 2. the size of the frag. 'frag_size'. > > > 3. an opaque token 'frag_token' to return to the kernel when the buffer > > > is to be released. > > > > > > The pages awaiting freeing are stored in the newly added > > > sk->sk_user_pages, and each page passed to userspace is get_page()'d. > > > This reference is dropped once the userspace indicates that it is > > > done reading this page. All pages are released when the socket is > > > destroyed. > > > > > > Signed-off-by: Willem de Bruijn <willemb@xxxxxxxxxx> > > > Signed-off-by: Kaiyuan Zhang <kaiyuanz@xxxxxxxxxx> > > > Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx> > > > > > > --- > > > > > > RFC v3: > > > - Fixed issue with put_cmsg() failing silently. > > > > > > --- > > > include/linux/socket.h | 1 + > > > include/net/page_pool/helpers.h | 9 ++ > > > include/net/sock.h | 2 + > > > include/uapi/asm-generic/socket.h | 5 + > > > include/uapi/linux/uio.h | 6 + > > > net/ipv4/tcp.c | 189 +++++++++++++++++++++++++++++- > > > net/ipv4/tcp_ipv4.c | 7 ++ > > > 7 files changed, 214 insertions(+), 5 deletions(-) > > > > > > diff --git a/include/linux/socket.h b/include/linux/socket.h > > > index cfcb7e2c3813..fe2b9e2081bb 100644 > > > --- a/include/linux/socket.h > > > +++ b/include/linux/socket.h > > > @@ -326,6 +326,7 @@ struct ucred { > > > * plain text and require encryption > > > */ > > > > > > +#define MSG_SOCK_DEVMEM 0x2000000 /* Receive devmem skbs as cmsg */ > > > > Sharing the feedback that I've been providing internally on the public list: > > > > There may have been a miscommunication. I don't recall hearing this > specific feedback from you, at least in the last few months. Sorry if > it seemed like I'm ignoring feedback :) No worries, there was a thread long time ago about this whole token interface and whether it should support out-of-order refills, etc. > > IMHO, we need a better UAPI to receive the tokens and give them back to > > the kernel. CMSG + setsockopt(SO_DEVMEM_DONTNEED) get the job done, > > but look dated and hacky :-( > > > > We should either do some kind of user/kernel shared memory queue to > > receive/return the tokens (similar to what Jonathan was doing in his > > proposal?) > > I'll take a look at Jonathan's proposal, sorry, I'm not immediately > familiar but I wanted to respond :-) But is the suggestion here to > build a new kernel-user communication channel primitive for the > purpose of passing the information in the devmem cmsg? IMHO that seems > like an overkill. Why add 100-200 lines of code to the kernel to add > something that can already be done with existing primitives? I don't > see anything concretely wrong with cmsg & setsockopt approach, and if > we switch to something I'd prefer to switch to an existing primitive > for simplicity? > > The only other existing primitive to pass data outside of the linear > buffer is the MSG_ERRQUEUE that is used for zerocopy. Is that > preferred? Any other suggestions or existing primitives I'm not aware > of? I guess I'm just wondering whether other people have any suggestions here. Not sure Jonathan's way was better, but we fundamentally have two queues between the kernel and the userspace: - userspace receiving tokens (recvmsg + magical flag) - userspace refilling tokens (setsockopt + magical flag) So having some kind of shared memory producer-consumer queue feels natural. And using 'classic' socket api here feels like a stretch, idk. But maybe I'm overthinking and overcomplicating :-) > > or bite the bullet and switch to io_uring. > > > > IMO io_uring & socket support are orthogonal, and one doesn't preclude > the other. As you know we like to use sockets and I believe there are > issues with io_uring adoption at Google that I'm not familiar with > (and could be wrong). I'm interested in exploring io_uring support as > a follow up but I think David Wei will be interested in io_uring > support as well anyway. Ack, might be one more reason on our side to adopt iouring :-p > > I was also suggesting to do it via netlink initially, but it's probably > > a bit slow for these purpose, idk. > > Yeah, I hear netlink is reserved for control paths and is > inappropriate for data path, but I'll let folks correct me if wrong. > > -- > Thanks, > Mina