Re: [PATCH net-next v8 4/4] vsock/virtio: MSG_ZEROCOPY flag support

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Sep 11, 2023 at 11:22:34PM +0300, Arseniy Krasnov wrote:
This adds handling of MSG_ZEROCOPY flag on transmission path:

1) If this flag is set and zerocopy transmission is possible (enabled
  in socket options and transport allows zerocopy), then non-linear
  skb will be created and filled with the pages of user's buffer.
  Pages of user's buffer are locked in memory by 'get_user_pages()'.
2) Replaces way of skb owning: instead of 'skb_set_owner_sk_safe()' it
  calls 'skb_set_owner_w()'. Reason of this change is that
  '__zerocopy_sg_from_iter()' increments 'sk_wmem_alloc' of socket, so
  to decrease this field correctly, proper skb destructor is needed:
  'sock_wfree()'. This destructor is set by 'skb_set_owner_w()'.
3) Adds new callback to 'struct virtio_transport': 'can_msgzerocopy'.
  If this callback is set, then transport needs extra check to be able
  to send provided number of buffers in zerocopy mode. Currently, the
  only transport that needs this callback set is virtio, because this
  transport adds new buffers to the virtio queue and we need to check,
  that number of these buffers is less than size of the queue (it is
  required by virtio spec). vhost and loopback transports don't need
  this check.

Signed-off-by: Arseniy Krasnov <avkrasnov@xxxxxxxxxxxxxxxxx>
---
Changelog:
v5(big patchset) -> v1:
 * Refactorings of 'if' conditions.
 * Remove extra blank line.
 * Remove 'frag_off' field unneeded init.
 * Add function 'virtio_transport_fill_skb()' which fills both linear
   and non-linear skb with provided data.
v1 -> v2:
 * Use original order of last four arguments in 'virtio_transport_alloc_skb()'.
v2 -> v3:
 * Add new transport callback: 'msgzerocopy_check_iov'. It checks that
   provided 'iov_iter' with data could be sent in a zerocopy mode.
   If this callback is not set in transport - transport allows to send
   any 'iov_iter' in zerocopy mode. Otherwise - if callback returns 'true'
   then zerocopy is allowed. Reason of this callback is that in case of
   G2H transmission we insert whole skb to the tx virtio queue and such
   skb must fit to the size of the virtio queue to be sent in a single
   iteration (may be tx logic in 'virtio_transport.c' could be reworked
   as in vhost to support partial send of current skb). This callback
   will be enabled only for G2H path. For details pls see comment
   'Check that tx queue...' below.
v3 -> v4:
 * 'msgzerocopy_check_iov' moved from 'struct vsock_transport' to
   'struct virtio_transport' as it is virtio specific callback and
   never needed in other transports.
v4 -> v5:
 * 'msgzerocopy_check_iov' renamed to 'can_msgzerocopy' and now it
   uses number of buffers to send as input argument. I think there is
   no need to pass iov to this callback (at least today, it is used only
   by guest side of virtio transport), because the only thing that this
   callback does is comparison of number of buffers to be inserted to
   the tx queue and size of this queue.
 * Remove any checks for type of current 'iov_iter' with payload (is it
   'iovec' or 'ubuf'). These checks left from the earlier versions where I
   didn't use already implemented kernel API which handles every type of
   'iov_iter'.
v5 -> v6:
 * Refactor 'virtio_transport_fill_skb()'.
 * Add 'WARN_ON_ONCE()' and comment on invalid combination of destination
   socket and payload in 'virtio_transport_alloc_skb()'.
v7 -> v8:
 * Move '+1' addition from 'can_msgzerocopy' callback body to the caller.
   This addition means packet header.
 * In 'virtio_transport_can_zcopy()' rename 'max_to_send' argument to
   'pkt_len'.
 * Update commit message by adding details about new 'can_msgzerocopy'
   callback.
 * In 'virtio_transport_init_hdr()' move 'len' argument directly after
   'info'.
 * Add comment about processing last skb in tx loop.
 * Update comment for 'can_msgzerocopy' callback for more details.

include/linux/virtio_vsock.h            |   9 +
net/vmw_vsock/virtio_transport.c        |  32 +++
net/vmw_vsock/virtio_transport_common.c | 256 ++++++++++++++++++------
3 files changed, 239 insertions(+), 58 deletions(-)

diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index a91fbdf233e4..ebb3ce63d64d 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -160,6 +160,15 @@ struct virtio_transport {

	/* Takes ownership of the packet */
	int (*send_pkt)(struct sk_buff *skb);
+
+	/* Used in MSG_ZEROCOPY mode. Checks, that provided data
+	 * (number of buffers) could be transmitted with zerocopy
+	 * mode. If this callback is not implemented for the current
+	 * transport - this means that this transport doesn't need
+	 * extra checks and can perform zerocopy transmission by
+	 * default.
+	 */
+	bool (*can_msgzerocopy)(int bufs_num);
};

ssize_t
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 73d730156349..09ba3128e759 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -455,6 +455,37 @@ static void virtio_vsock_rx_done(struct virtqueue *vq)
	queue_work(virtio_vsock_workqueue, &vsock->rx_work);
}

+static bool virtio_transport_can_msgzerocopy(int bufs_num)
+{
+	struct virtio_vsock *vsock;
+	bool res = false;
+
+	rcu_read_lock();
+
+	vsock = rcu_dereference(the_virtio_vsock);
+	if (vsock) {
+		struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX];
+
+		/* Check that tx queue is large enough to keep whole
+		 * data to send. This is needed, because when there is
+		 * not enough free space in the queue, current skb to
+		 * send will be reinserted to the head of tx list of
+		 * the socket to retry transmission later, so if skb
+		 * is bigger than whole queue, it will be reinserted
+		 * again and again, thus blocking other skbs to be sent.
+		 * Each page of the user provided buffer will be added
+		 * as a single buffer to the tx virtqueue, so compare
+		 * number of pages against maximum capacity of the queue.
+		 */
+		if (bufs_num <= vq->num_max)
+			res = true;
+	}
+
+	rcu_read_unlock();
+
+	return res;
+}
+
static bool virtio_transport_seqpacket_allow(u32 remote_cid);

static struct virtio_transport virtio_transport = {
@@ -504,6 +535,7 @@ static struct virtio_transport virtio_transport = {
	},

	.send_pkt = virtio_transport_send_pkt,
+	.can_msgzerocopy = virtio_transport_can_msgzerocopy,
};

static bool virtio_transport_seqpacket_allow(u32 remote_cid)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 3a48e48a99ac..e358f118b07e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -37,73 +37,110 @@ virtio_transport_get_ops(struct vsock_sock *vsk)
	return container_of(t, struct virtio_transport, transport);
}

-/* Returns a new packet on success, otherwise returns NULL.
- *
- * If NULL is returned, errp is set to a negative errno.
- */

Why we are removing this comment?

-static struct sk_buff *
-virtio_transport_alloc_skb(struct virtio_vsock_pkt_info *info,
-			   size_t len,
-			   u32 src_cid,
-			   u32 src_port,
-			   u32 dst_cid,
-			   u32 dst_port)
-{
-	const size_t skb_len = VIRTIO_VSOCK_SKB_HEADROOM + len;
-	struct virtio_vsock_hdr *hdr;
-	struct sk_buff *skb;
-	void *payload;
-	int err;
+static bool virtio_transport_can_zcopy(struct virtio_vsock_pkt_info *info,
+				       size_t pkt_len)
+{
+	const struct virtio_transport *t_ops;
+	struct iov_iter *iov_iter;

-	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
-	if (!skb)
-		return NULL;
+	if (!info->msg)
+		return false;

-	hdr = virtio_vsock_hdr(skb);
-	hdr->type	= cpu_to_le16(info->type);
-	hdr->op		= cpu_to_le16(info->op);
-	hdr->src_cid	= cpu_to_le64(src_cid);
-	hdr->dst_cid	= cpu_to_le64(dst_cid);
-	hdr->src_port	= cpu_to_le32(src_port);
-	hdr->dst_port	= cpu_to_le32(dst_port);
-	hdr->flags	= cpu_to_le32(info->flags);
-	hdr->len	= cpu_to_le32(len);
+	iov_iter = &info->msg->msg_iter;

-	if (info->msg && len > 0) {
-		payload = skb)put(skb, len);
-		err = memcpy_from_msg(payload, info->msg, len);
-		if (err)
-			goto out;
+	if (iov_iter->iov_offset)
+		return false;

-		if (msg_data_left(info->msg) == 0 &&
-		    info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {
-			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+	/* We can't send whole iov. */
+	if (iov_iter->count > pkt_len)
+		return false;

-			if (info->msg->msg_flags & MSG_EOR)
-				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
-		}
+	/* Check that transport can send data in zerocopy mode. */
+	t_ops = virtio_transport_get_ops(info->vsk);

While reviewing I was wondering why here we don't check if `t_ops` is
NULL.

Then I realized that the only caller of this function
(virtio_transport_send_pkt_info()) already get the vsk ops calling
virtio_transport_get_ops() and also checks if it can be null.

So what about passing the ops as function parameter to avoid to call
virtio_transport_get_ops() again?

+
+	if (t_ops->can_msgzerocopy) {
+		int pages_in_iov = iov_iter_npages(iov_iter, MAX_SKB_FRAGS);
+		int pages_to_send = min(pages_in_iov, MAX_SKB_FRAGS);
+
+		/* +1 is for packet header. */
+		return t_ops->can_msgzerocopy(pages_to_send + 1);
	}

-	if (info->reply)
-		virtio_vsock_skb_set_reply(skb);
+	return true;
+}

-	trace_virtio_transport_alloc_pkt(src_cid, src_port,
-					 dst_cid, dst_port,
-					 len,
-					 info->type,
-					 info->op,
-					 info->flags);
+static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk,
+					   struct sk_buff *skb,
+					   struct msghdr *msg,
+					   bool zerocopy)
+{
+	struct ubuf_info *uarg;

-	if (info->vsk && !skb_set_owner_sk_safe(skb, sk_vsock(info->vsk))) {
-		WARN_ONCE(1, "failed to allocate skb on vsock socket with sk_refcnt == 0\n");
-		goto out;
+	if (msg->msg_ubuf) {
+		uarg = msg->msg_ubuf;
+		net_zcopy_get(uarg);
+	} else {
+		struct iov_iter *iter = &msg->msg_iter;
+		struct ubuf_info_msgzc *uarg_zc;
+
+		uarg = msg_zerocopy_realloc(sk_vsock(vsk),
+					    iter->count,
+					    NULL);
+		if (!uarg)
+			return -1;
+
+		uarg_zc = uarg_to_msgzc(uarg);
+		uarg_zc->zerocopy = zerocopy ? 1 : 0;
	}

-	return skb;
+	skb_zcopy_init(skb, uarg);

-out:
-	kfree_skb(skb);
-	return NULL;
+	return 0;
+}
+
+static int virtio_transport_fill_skb(struct sk_buff *skb,
+				     struct virtio_vsock_pkt_info *info,
+				     size_t len,
+				     bool zcopy)
+{
+	void *payload;
+	int err;
+
+	if (zcopy)
+		return __zerocopy_sg_from_iter(info->msg, NULL, skb,
+					       &info->msg->msg_iter,
+					       len);
+
+	payload = skb_put(skb, len);
+	err = memcpy_from_msg(payload, info->msg, len);
+	if (err)
+		return -1;
+
+	if (msg_data_left(info->msg))
+		return 0;

We are returning 0 in any case, what is the purpose of this check?

+
+	return 0;
+}
+
+static void virtio_transport_init_hdr(struct sk_buff *skb,
+				      struct virtio_vsock_pkt_info *info,
+				      size_t payload_len,
+				      u32 src_cid,
+				      u32 src_port,
+				      u32 dst_cid,
+				      u32 dst_port)
+{
+	struct virtio_vsock_hdr *hdr;
+
+	hdr = virtio_vsock_hdr(skb);
+	hdr->type	= cpu_to_le16(info->type);
+	hdr->op		= cpu_to_le16(info->op);
+	hdr->src_cid	= cpu_to_le64(src_cid);
+	hdr->dst_cid	= cpu_to_le64(dst_cid);
+	hdr->src_port	= cpu_to_le32(src_port);
+	hdr->dst_port	= cpu_to_le32(dst_port);
+	hdr->flags	= cpu_to_le32(info->flags);
+	hdr->len	= cpu_to_le32(payload_len);
}

static void virtio_transport_copy_nonlinear_skb(const struct sk_buff *skb,
@@ -214,6 +251,77 @@ static u16 virtio_transport_get_type(struct sock *sk)
		return VIRTIO_VSOCK_TYPE_SEQPACKET;
}

+static struct sk_buff *virtio_transport_alloc_skb(struct vsock_sock *vsk,

Before this patch we used `info->vsk` in virtio_transport_alloc_skb().
Is it now really necessary to add vsk as a parameter? If so, why?

+						  struct virtio_vsock_pkt_info *info,
+						  size_t payload_len,
+						  bool zcopy,
+						  u32 src_cid,
+						  u32 src_port,
+						  u32 dst_cid,
+						  u32 dst_port)
+{
+	struct sk_buff *skb;
+	size_t skb_len;
+
+	skb_len = VIRTIO_VSOCK_SKB_HEADROOM;
+
+	if (!zcopy)
+		skb_len += payload_len;
+
+	skb = virtio_vsock_alloc_skb(skb_len, GFP_KERNEL);
+	if (!skb)
+		return NULL;
+
+	virtio_transport_init_hdr(skb, info, payload_len, src_cid, src_port,
+				  dst_cid, dst_port);
+
+	/* If 'vsk' != NULL then payload is always present, so we
+	 * will never call '__zerocopy_sg_from_iter()' below without
+	 * setting skb owner in 'skb_set_owner_w()'. The only case
+	 * when 'vsk' == NULL is VIRTIO_VSOCK_OP_RST control message
+	 * without payload.
+	 */
+	WARN_ON_ONCE(!(vsk && (info->msg && payload_len)) && zcopy);
+
+	/* Set owner here, because '__zerocopy_sg_from_iter()' uses
+	 * owner of skb without check to update 'sk_wmem_alloc'.
+	 */
+	if (vsk)
+		skb_set_owner_w(skb, sk_vsock(vsk));
+
+	if (info->msg && payload_len > 0) {
+		int err;
+
+		err = virtio_transport_fill_skb(skb, info, payload_len, zcopy);
+		if (err)
+			goto out;
+
+		if (info->type == VIRTIO_VSOCK_TYPE_SEQPACKET) {

Before this patch, we did these steps only if
`msg_data_left(info->msg) == 0`, why now we do it in any case?

+			struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
+
+			hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOM);
+
+			if (info->msg->msg_flags & MSG_EOR)
+				hdr->flags |= cpu_to_le32(VIRTIO_VSOCK_SEQ_EOR);
+		}
+	}
+
+	if (info->reply)
+		virtio_vsock_skb_set_reply(skb);
+
+	trace_virtio_transport_alloc_pkt(src_cid, src_port,
+					 dst_cid, dst_port,
+					 payload_len,
+					 info->type,
+					 info->op,
+					 info->flags);

Maybe now we should trace also `zcopy`.

+
+	return skb;
+out:
+	kfree_skb(skb);
+	return NULL;
+}
+
/* This function can only be used on connecting/connected sockets,
 * since a socket assigned to a transport is required.
 *
@@ -222,10 +330,12 @@ static u16 virtio_transport_get_type(struct sock *sk)
static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
					  struct virtio_vsock_pkt_info *info)
{
+	u32 max_skb_len = VIRTIO_VSOCK_MAX_PKT_BUF_SIZE;
	u32 src_cid, src_port, dst_cid, dst_port;
	const struct virtio_transport *t_ops;
	struct virtio_vsock_sock *vvs;
	u32 pkt_len = info->pkt_len;
+	bool can_zcopy = false;
	u32 rest_len;
	int ret;

@@ -254,15 +364,30 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
		return pkt_len;

+	if (info->msg) {
+		/* If zerocopy is not enabled by 'setsockopt()', we behave as
+		 * there is no MSG_ZEROCOPY flag set.
+		 */
+		if (!sock_flag(sk_vsock(vsk), SOCK_ZEROCOPY))
+			info->msg->msg_flags &= ~MSG_ZEROCOPY;
+
+		if (info->msg->msg_flags & MSG_ZEROCOPY)
+			can_zcopy = virtio_transport_can_zcopy(info, pkt_len);
+
+		if (can_zcopy)
+			max_skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE,
+					    (MAX_SKB_FRAGS * PAGE_SIZE));
+	}
+
	rest_len = pkt_len;

	do {
		struct sk_buff *skb;
		size_t skb_len;

-		skb_len = min_t(u32, VIRTIO_VSOCK_MAX_PKT_BUF_SIZE, rest_len);
+		skb_len = min(max_skb_len, rest_len);

-		skb = virtio_transport_alloc_skb(info, skb_len,
+		skb = virtio_transport_alloc_skb(vsk, info, skb_len, can_zcopy,
						 src_cid, src_port,
						 dst_cid, dst_port);
		if (!skb) {
@@ -270,6 +395,21 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
			break;
		}

+		/* We process buffer part by part, allocating skb on
+		 * each iteration. If this is last skb for this buffer
+		 * and MSG_ZEROCOPY mode is in use - we must allocate
+		 * completion for the current syscall.
+		 */
+		if (info->msg && info->msg->msg_flags & MSG_ZEROCOPY &&
+		    skb_len == rest_len && info->op == VIRTIO_VSOCK_OP_RW) {
+			if (virtio_transport_init_zcopy_skb(vsk, skb,
+							    info->msg,
+							    can_zcopy)) {
+				ret = -ENOMEM;
+				break;
+			}
+		}
+
		virtio_transport_inc_tx_pkt(vvs, skb);

		ret = t_ops->send_pkt(skb);
@@ -985,7 +1125,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
	if (!t)
		return -ENOTCONN;

-	reply = virtio_transport_alloc_skb(&info, 0,
+	reply = virtio_transport_alloc_skb(NULL, &info, 0, false,
					   le64_to_cpu(hdr->dst_cid),
					   le32_to_cpu(hdr->dst_port),
					   le64_to_cpu(hdr->src_cid),
--
2.25.1





[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux