On Mon, Jul 01, 2024 at 04:28:03PM GMT, Luigi Leonardi via B4 Relay wrote:
From: Marco Pinna <marco.pinn95@xxxxxxxxx> Introduce an optimization in virtio_transport_send_pkt: when the work queue (send_pkt_queue) is empty the packet is put directly in the virtqueue reducing latency. In the following benchmark (pingpong mode) the host sends a payload to the guest and waits for the same payload back. All vCPUs pinned individually to pCPUs. vhost process pinned to a pCPU fio process pinned both inside the host and the guest system. Host CPU: Intel i7-10700KF CPU @ 3.80GHz Tool: Fio version 3.37-56 Env: Phys host + L1 Guest Payload: 512 Runtime-per-test: 50s Mode: pingpong (h-g-h) Test runs: 50 Type: SOCK_STREAM Before (Linux 6.8.11) ------ mean(1st percentile): 380.56 ns mean(overall): 780.83 ns mean(99th percentile): 8300.24 ns After ------ mean(1st percentile): 370.59 ns mean(overall): 720.66 ns mean(99th percentile): 7600.27 ns Same setup, using 4K payload: Before (Linux 6.8.11) ------ mean(1st percentile): 458.84 ns mean(overall): 1650.17 ns mean(99th percentile): 42240.68 ns After ------ mean(1st percentile): 450.12 ns mean(overall): 1460.84 ns mean(99th percentile): 37632.45 ns virtqueue. Throughput: iperf-vsock Before (Linux 6.8.11) G2H 28.7 Gb/s After G2H 40.8 Gb/s
Cool! I'd suggest to add the length of buffer (-l param) used, and also check more lenghts, like at least 4k, 64k, 128k.
The performance improvement is related to this optimization, I checked that each packet was put directly on the vq avoiding the work queue.
How?
Co-developed-by: Luigi Leonardi <luigi.leonardi@xxxxxxxxxxx> Signed-off-by: Luigi Leonardi <luigi.leonardi@xxxxxxxxxxx> Signed-off-by: Marco Pinna <marco.pinn95@xxxxxxxxx>
I think you might want to change the author of this patch, since it's changed a lot from Marco's original one. Obviously if you both agree on this.
Thanks, Stefano
--- net/vmw_vsock/virtio_transport.c | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c index a74083d28120..3815aa8d956b 100644 --- a/net/vmw_vsock/virtio_transport.c +++ b/net/vmw_vsock/virtio_transport.c @@ -213,6 +213,7 @@ virtio_transport_send_pkt(struct sk_buff *skb) { struct virtio_vsock_hdr *hdr; struct virtio_vsock *vsock; + bool use_worker = true; int len = skb->len; hdr = virtio_vsock_hdr(skb); @@ -234,8 +235,41 @@ virtio_transport_send_pkt(struct sk_buff *skb) if (virtio_vsock_skb_reply(skb)) atomic_inc(&vsock->queued_replies); - virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb); - queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work); + /* If the workqueue (send_pkt_queue) is empty there is no need to enqueue the packet. + * Just put it on the virtqueue using virtio_transport_send_skb. + */ + if (skb_queue_empty_lockless(&vsock->send_pkt_queue)) { + bool restart_rx = false; + struct virtqueue *vq; + int ret; + + /* Inside RCU, can't sleep! */ + ret = mutex_trylock(&vsock->tx_lock); + if (unlikely(ret == 0)) + goto out_worker; + + /* Driver is being removed, no need to enqueue the packet */ + if (!vsock->tx_run) + goto out_rcu; + + vq = vsock->vqs[VSOCK_VQ_TX]; + + if (!virtio_transport_send_skb(skb, vq, vsock, &restart_rx)) { + use_worker = false; + virtqueue_kick(vq); + } + + mutex_unlock(&vsock->tx_lock); + + if (restart_rx) + queue_work(virtio_vsock_workqueue, &vsock->rx_work); + } + +out_worker: + if (use_worker) { + virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb); + queue_work(virtio_vsock_workqueue, &vsock->send_pkt_work); + } out_rcu: rcu_read_unlock(); -- 2.45.2