On Mon, Nov 6, 2023 at 11:34 AM David Ahern <dsahern@xxxxxxxxxx> wrote: > > On 11/6/23 11:47 AM, Stanislav Fomichev wrote: > > On 11/05, Mina Almasry wrote: > >> For device memory TCP, we expect the skb headers to be available in host > >> memory for access, and we expect the skb frags to be in device memory > >> and unaccessible to the host. We expect there to be no mixing and > >> matching of device memory frags (unaccessible) with host memory frags > >> (accessible) in the same skb. > >> > >> Add a skb->devmem flag which indicates whether the frags in this skb > >> are device memory frags or not. > >> > >> __skb_fill_page_desc() now checks frags added to skbs for page_pool_iovs, > >> and marks the skb as skb->devmem accordingly. > >> > >> Add checks through the network stack to avoid accessing the frags of > >> devmem skbs and avoid coalescing devmem skbs with non devmem skbs. > >> > >> Signed-off-by: Willem de Bruijn <willemb@xxxxxxxxxx> > >> Signed-off-by: Kaiyuan Zhang <kaiyuanz@xxxxxxxxxx> > >> Signed-off-by: Mina Almasry <almasrymina@xxxxxxxxxx> > >> > >> --- > >> include/linux/skbuff.h | 14 +++++++- > >> include/net/tcp.h | 5 +-- > >> net/core/datagram.c | 6 ++++ > >> net/core/gro.c | 5 ++- > >> net/core/skbuff.c | 77 ++++++++++++++++++++++++++++++++++++------ > >> net/ipv4/tcp.c | 6 ++++ > >> net/ipv4/tcp_input.c | 13 +++++-- > >> net/ipv4/tcp_output.c | 5 ++- > >> net/packet/af_packet.c | 4 +-- > >> 9 files changed, 115 insertions(+), 20 deletions(-) > >> > >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h > >> index 1fae276c1353..8fb468ff8115 100644 > >> --- a/include/linux/skbuff.h > >> +++ b/include/linux/skbuff.h > >> @@ -805,6 +805,8 @@ typedef unsigned char *sk_buff_data_t; > >> * @csum_level: indicates the number of consecutive checksums found in > >> * the packet minus one that have been verified as > >> * CHECKSUM_UNNECESSARY (max 3) > >> + * @devmem: indicates that all the fragments in this skb are backed by > >> + * device memory. > >> * @dst_pending_confirm: need to confirm neighbour > >> * @decrypted: Decrypted SKB > >> * @slow_gro: state present at GRO time, slower prepare step required > >> @@ -991,7 +993,7 @@ struct sk_buff { > >> #if IS_ENABLED(CONFIG_IP_SCTP) > >> __u8 csum_not_inet:1; > >> #endif > >> - > >> + __u8 devmem:1; > >> #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS) > >> __u16 tc_index; /* traffic control index */ > >> #endif > >> @@ -1766,6 +1768,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb) > >> __skb_zcopy_downgrade_managed(skb); > >> } > >> > >> +/* Return true if frags in this skb are not readable by the host. */ > >> +static inline bool skb_frags_not_readable(const struct sk_buff *skb) > >> +{ > >> + return skb->devmem; > > > > bikeshedding: should we also rename 'devmem' sk_buff flag to 'not_readable'? > > It better communicates the fact that the stack shouldn't dereference the > > frags (because it has 'devmem' fragments or for some other potential > > future reason). > > +1. > > Also, the flag on the skb is an optimization - a high level signal that > one or more frags is in unreadable memory. There is no requirement that > all of the frags are in the same memory type. The flag indicates that the skb contains all devmem dma-buf memory specifically, not generic 'not_readable' frags as the comment says: + * @devmem: indicates that all the fragments in this skb are backed by + * device memory. The reason it's not a generic 'not_readable' flag is because handing off a generic not_readable skb to the userspace is semantically not what we're doing. recvmsg() is augmented in this patch series to return a devmem skb to the user via a cmsg_devmem struct which refers specifically to the memory in the dma-buf. recvmsg() in this patch series is not augmented to give any 'not_readable' skb to the userspace. IMHO skb->devmem + an skb_frags_not_readable() as implemented is correct. If a new type of unreadable skbs are introduced to the stack, I imagine the stack would implement: 1. new header flag: skb->newmem 2. static inline bool skb_frags_not_readable(const struct skb_buff *skb) { return skb->devmem || skb->newmem; } 3. tcp_recvmsg_devmem() would handle skb->devmem skbs is in this patch series, but tcp_recvmsg_newmem() would handle skb->newmem skbs. -- Thanks, Mina