Re: [PATCH net-next v3 2/2] vsock/virtio: avoid queuing packets when work queue is empty

Stefano Garzarella <sgarzare@xxxxxxxxxx> · Fri, 12 Jul 2024 16:58:20 +0200

On Thu, Jul 11, 2024 at 04:58:47PM GMT, Luigi Leonardi via B4 Relay wrote:
From: Luigi Leonardi <luigi.leonardi@xxxxxxxxxxx>

Introduce an optimization in virtio_transport_send_pkt:
when the work queue (send_pkt_queue) is empty the packet is

Note: send_pkt_queue is just a queue of sk_buff, is not really a work 
queue.

put directly in the virtqueue increasing the throughput.

Why?

I'd write something like this, but feel free to change it:

When the driver needs to send new packets to the device, it always
queues the new sk_buffs into an intermediate queue (send_pkt_queue)
and schedules a worker (send_pkt_work) to then queue them into the
virtqueue exposed to the device.

This increases the chance of batching, but also introduces a lot of
latency into the communication. So we can optimize this path by
adding a fast path to be taken when there is no element in the
intermediate queue, there is space available in the virtqueue,
and no other process that is sending packets (tx_lock held).



In the following benchmark (pingpong mode) the host sends

"fio benchmark"

a payload to the guest and waits for the same payload back.

All vCPUs pinned individually to pCPUs.
vhost process pinned to a pCPU
fio process pinned both inside the host and the guest system.

Host CPU: Intel i7-10700KF CPU @ 3.80GHz
Tool: Fio version 3.37-56
Env: Phys host + L1 Guest
Runtime-per-test: 50s
Mode: pingpong (h-g-h)
Test runs: 50
Type: SOCK_STREAM

Before: Linux 6.9.7

Payload 512B:

	1st perc.	overall		99th perc.
Before	370		810.15		8656		ns
After	374		780.29		8741		ns

Payload 4K:

	1st perc.	overall		99th perc.
Before	460		1720.23		42752		ns
After	460		1520.84		36096		ns

The performance improvement is related to this optimization,
I used ebpf to check that each packet was sent directly to the
virtqueue.

Throughput: iperf-vsock

I would reorganize the description for a moment because it's a little 
confusing. For example like this:

The following benchmarks were run to check improvements in latency and 
throughput. The test bed is a host with Intel i7-10700KF CPU @ 3.80GHz 
and L1 guest running on QEMU/KVM.

- Latency
  Tool: ...

- Throughput
  Tool: ...

The size represents the buffer length (-l) to read/write
P represents the number parallel streams

P=1
	4K	64K	128K
Before	6.87	29.3	29.5 Gb/s
After	10.5	39.4	39.9 Gb/s

P=2
	4K	64K	128K
Before	10.5	32.8	33.2 Gb/s
After	17.8	47.7	48.5 Gb/s

P=4
	4K	64K	128K
Before	12.7	33.6	34.2 Gb/s
After	16.9	48.1	50.5 Gb/s

Wow, great! I'm a little surprised that the latency is not much 
affected, but the throughput benefits so much with that kind of 
optimization.

Maybe we can check the latency with smaller payloads like 64 bytes or 
even smaller.


Co-developed-by: Marco Pinna <marco.pinn95@xxxxxxxxx>
Signed-off-by: Marco Pinna <marco.pinn95@xxxxxxxxx>
Signed-off-by: Luigi Leonardi <luigi.leonardi@xxxxxxxxxxx>
---
net/vmw_vsock/virtio_transport.c | 38 ++++++++++++++++++++++++++++++++++----
1 file changed, 34 insertions(+), 4 deletions(-)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index c4205c22f40b..d75727fdc35f 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -208,6 +208,29 @@ virtio_transport_send_pkt_work(struct work_struct *work)
		queue_work(virtio_vsock_workqueue, &vsock->rx_work);
}

+/* Caller need to hold RCU for vsock.
+ * Returns 0 if the packet is successfully put on the vq.
+ */
+static int virtio_transport_send_skb_fast_path(struct virtio_vsock *vsock, struct sk_buff *skb)
+{
+	struct virtqueue *vq = vsock->vqs[VSOCK_VQ_TX];
+	int ret;
+
+	/* Inside RCU, can't sleep! */
+	ret = mutex_trylock(&vsock->tx_lock);
+	if (unlikely(ret == 0))
+		return -EBUSY;
+
+	ret = virtio_transport_send_skb(skb, vq, vsock);
+
+	mutex_unlock(&vsock->tx_lock);
+
+	/* Kick if virtio_transport_send_skb succeeded */

Superfluous comment, we can remove it.

+	if (ret == 0)
+		virtqueue_kick(vq);

nit: I'd add a blank line here after the if block to highlight that the 
return is out.

+	return ret;
+}
+
static int
virtio_transport_send_pkt(struct sk_buff *skb)
{
@@ -231,11 +254,18 @@ virtio_transport_send_pkt(struct sk_buff *skb)
		goto out_rcu;
	}

-	if (virtio_vsock_skb_reply(skb))
-		atomic_inc(&vsock->queued_replies);
+	/* If the workqueue (send_pkt_queue) is empty there is no need to enqueue the packet.

Again, send_pkt_queue is not a workqueue.

Here I would explain more why there is no need, the fact that we are not 
doing this is clear.

+	 * Just put it on the virtqueue using virtio_transport_send_skb_fast_path.
+	 */


nit: here I would instead remove the blank line to make it clear that 
the comment is about the code below.

-	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
-	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+	if (!skb_queue_empty_lockless(&vsock->send_pkt_queue) ||
+	    virtio_transport_send_skb_fast_path(vsock, skb)) {
+		/* Packet must be queued */

Please, include it in the comment before the if where you can explain 
the whole logic of the optimization.

+		if (virtio_vsock_skb_reply(skb))
+			atomic_inc(&vsock->queued_replies);

nit: blank line, how it was before this patch:

	if (virtio_vsock_skb_reply(skb))
		atomic_inc(&vsock->queued_replies);

	virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
	queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);


+		virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
+		queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work);
+	}

out_rcu:
	rcu_read_unlock();

--
2.45.2



I tested the patch and everything seems to be fine, all my comments are 
minor and style, the code should be fine!

Thanks,
Stefano