Re: [PATCH] virtio_ring: Fix the stale index in available ring

Gavin Shan <gshan@xxxxxxxxxx> · Thu, 14 Mar 2024 22:50:15 +1000

On 3/14/24 21:50, Michael S. Tsirkin wrote:
On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote:
On 3/14/24 18:05, Michael S. Tsirkin wrote:
On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote:
The issue is reported by Yihuang Yu who have 'netperf' test on
NVidia's grace-grace and grace-hopper machines. The 'netperf'
client is started in the VM hosted by grace-hopper machine,
while the 'netperf' server is running on grace-grace machine.

The VM is started with virtio-net and vhost has been enabled.
We observe a error message spew from VM and then soft-lockup
report. The error message indicates the data associated with
the descriptor (index: 135) has been released, and the queue
is marked as broken. It eventually leads to the endless effort
to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit()
and soft-lockup. The stale index 135 is fetched from the available
ring and published to the used ring by vhost, meaning we have
disordred write to the available ring element and available index.

    /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64              \
    -accel kvm -machine virt,gic-version=host                            \
       :                                                                 \
    -netdev tap,id=vnet0,vhost=on                                        \
    -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \

    [   19.993158] virtio_net virtio1: output.0:id 135 is not a head!

Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger
virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on
ARM64. It should work for other architectures, but performance loss is
expected.

Cc: stable@xxxxxxxxxxxxxxx
Reported-by: Yihuang Yu <yihyu@xxxxxxxxxx>
Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
---
   drivers/virtio/virtio_ring.c | 12 +++++++++---
   1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 49299b1f9ec7..7d852811c912 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
   	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
-	/* Descriptors and available array need to be set before we expose the
-	 * new available array entries. */
-	virtio_wmb(vq->weak_barriers);
+	/*
+	 * Descriptors and available array need to be set before we expose
+	 * the new available array entries. virtio_wmb() should be enough
+	 * to ensuere the order theoretically. However, a stronger barrier
+	 * is needed by ARM64. Otherwise, the stale data can be observed
+	 * by the host (vhost). A stronger barrier should work for other
+	 * architectures, but performance loss is expected.
+	 */
+	virtio_mb(false);


I don't get what is going on here. Any explanation why virtio_wmb is not
enough besides "it does not work"?


The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier
than "dmb" because "dsb" ensures that all memory accesses raised before this
instruction is completed when the 'dsb' instruction completes. However, "dmb"
doesn't guarantee the order of completion of the memory accesses.

So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.

Completed as observed by which CPU?
We have 2 writes that we want observed by another CPU in order.
So if CPU observes a new value of idx we want it to see
new value in ring.
This is standard use of smp_wmb()
How are these 2 writes different?

What DMB does, is that is seems to ensure that effects
of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)'
are observed after effects of
'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'.



Completed as observed by the CPU where vhost worker is running. I don't think DMB
does the work here. If I'm understanding correctly, DMB ensures the order of these
two writes from the local CPU's standpoint. The written data can be stored in local
CPU's cache, not flushed to DRAM and propogated to the cache of the far CPU where
vhost worker is running. So DMB isn't ensuring the write data is observed from the
far CPU.

DSB ensures that the written data is observable from the far CPU immediately.
 


The stronger barrier 'dsb' ensures the completion order as we expected.

     virtio_wmb(true)         virt_mb(false)
       virt_wmb                 mb
         __smp_wmb               __mb
           dmb(ishst)              dsb(sy)

First, why would you want a non smp barrier when you are
synchronizing with another CPU? This is what smp_ means ...


Extraced from ARMv9 specificaton
================================
The DMB instruction is a memory barrier instruction that ensures the relative
order of memory accesses before the barrier with memory accesses after the
barrier. The DMB instruction _does not_ ensure the completion of any of the
memory accesses for which it ensures relative order.

Isn't this exactly what we need?


I don't think so. DMB gurantees the order of EXECUTION, but DSB gurantees the
order of COMPLETION. After the data store is executed, the written data can
be stored in local CPU's cache, far from completion. Only the instruction is
completed, the written data is synchronized to the far CPU where vhost worker
is running.

Here is another paper talking about the difference between DMB and DSB. It's
explaining the difference between DMB/DSB better than ARMv9 specification.
However, it's hard for us to understand DMB/DSB work from the hardware level
based on the specification and paper.

https://ipads.se.sjtu.edu.cn/_media/publications/liuppopp20.pdf
(section 2.2  Order-preserving Options)
(I'm extracting the relevant part as below)

Data Memory Barrier (DMB) prevents reordering of memory accesses across the barrier.
Instead of waiting for previous accesses to become observable in the specified domain,
DMB only maintains the relative order between memory accesses. Meanwhile, DMB does
not block any non-memory access operations.

Data Synchronization Barrier (DSB) prevents reordering of any instructions across
the barrier. DSB will make sure that all masters in the specified domain can observe
the previous operations before issuing any subsequent instructions.

A DSB instruction is a memory barrier that ensures that memory accesses that
occur before the DSB instruction have __completed__ before the completion of
the DSB instruction.

This seems appropriate for when you want to order things more
strongly. I do not get why it's necessary here.


Please refer to above explanations.

In doing this, it acts as a stronger barrier than a DMB
and all ordering that is created by a DMB with specific options is also generated
by a DSB with the same options.

   	vq->split.avail_idx_shadow++;
   	vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev,
   						vq->split.avail_idx_shadow);
--
2.44.0


Thanks,
Gavin