Hi Willem, sorry for the late reply, some holiday vacations and other priorities pulled me away from this a bit. I'm getting ready to post RFC V2, so answering some questions ahead of when I do that. On Thu, Dec 26, 2024 at 1:52 PM Willem de Bruijn <willemdebruijn.kernel@xxxxxxxxx> wrote: ... > > +static int zerocopy_fill_skb_from_devmem(struct sk_buff *skb, > > + struct msghdr *msg, > > + struct iov_iter *from, int length) > > +{ > > + int i = skb_shinfo(skb)->nr_frags; > > + int orig_length = length; > > + netmem_ref netmem; > > + size_t size; > > + > > + while (length && iov_iter_count(from)) { > > + if (i == MAX_SKB_FRAGS) > > + return -EMSGSIZE; > > + > > + size = min_t(size_t, iter_iov_len(from), length); > > + if (!size) > > + return -EFAULT; > > On error, should caller skb_zerocopy_iter_stream rewind from, rather > than (or as well as) msg->msg_iter? Ah, so this was confusing, because there were 2 iterators to keep track off, (a) binding->tx_iter, which is `from` and (b) msg->msg_iter. Stan suggested removing binding->tx_iter entirely, so that we're back to using only 1 iterator, which is msg->msg_iter. That does simplify the code greatly, and I think addresses this comment as well, because there will be no need to make sure from is advanced/reverted with msg->msg_iter. > > + > > + netmem = net_iov_to_netmem(iter_iov(from)->iov_base); > > + get_netmem(netmem); > > + skb_add_rx_frag_netmem(skb, i, netmem, from->iov_offset, size, > > + PAGE_SIZE); > > + > > + iov_iter_advance(from, size); > > + length -= size; > > + i++; > > + } > > + > > + iov_iter_advance(&msg->msg_iter, orig_length); > > What does this do if sendmsg is called with NULL as buffer? So even if iov_base == NULL, the iterator is created anyhow. The iterator will be from addresses 0 -> iov_len. In the next iteration, I've applied Stan's suggestion to use iov_base as the offset into the dma-buf to send from. I think it ends up being a much cleaner UAPI, but let me know what you think. ... > > struct net_iov * > > net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) > > @@ -109,6 +112,13 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) > > unsigned long xa_idx; > > unsigned int rxq_idx; > > > > + xa_erase(&net_devmem_dmabuf_bindings, binding->id); > > + > > + /* Ensure no tx net_devmem_lookup_dmabuf() are in flight after the > > + * erase. > > + */ > > + synchronize_net(); > > + > > What precisely does this protect? > > synchronize_net() ensures no packet is in flight inside an rcu > readside section. But a packet can still be in flight, such as posted > to the device or queued in a qdisc. > The TX data path does a net_devmem_lookup_dmabuf() to lookup the dmabuf_id provided by the user. But that dmabuf_id may be unbind'd via net_devmem_unbind_dmabuf () by the user at any time, so some synchronization is needed to make sure we don't do a send from a dmabuf that is being freed in another thread. The synchronization in this patch is such that the lookup path obtains a reference under rcu lock, and the unbind control path makes sure to wait a full RCU grace period before dropping reference via net_devmem_dmabuf_binding_put(). net_devmem_dmabuf_binding_put() will trigger freeing the binding if the refcount hits zero. ... > > +struct net_devmem_dmabuf_binding * > > +net_devmem_get_sockc_binding(struct sock *sk, struct sockcm_cookie *sockc) > > +{ > > + struct net_devmem_dmabuf_binding *binding; > > + int err = 0; > > + > > + binding = net_devmem_lookup_dmabuf(sockc->dmabuf_id); > > This lookup is from global xarray net_devmem_dmabuf_bindings. > > Is there a check that the socket is sending out through the device > to which this dmabuf was bound with netlink? Should there be? > (e.g., SO_BINDTODEVICE). > Yes, I think it may be an issue if the user triggers a send from a different netdevice, because indeed when we bind a dmabuf we bind it to a specific netdevice. One option is as you say to require TX sockets to be bound and to check that we're bound to the correct netdev. I also wonder if I can make this work without SO_BINDTODEVICE, by querying the netdev the sock is currently trying to send out on and doing a check in the tcp_sendmsg. I'm not sure if this is possible but I'll give it a look. -- Thanks, Mina