On Thu, Mar 14, 2024 at 08:15:22PM +1000, Gavin Shan wrote: > On 3/14/24 18:05, Michael S. Tsirkin wrote: > > On Thu, Mar 14, 2024 at 05:49:23PM +1000, Gavin Shan wrote: > > > The issue is reported by Yihuang Yu who have 'netperf' test on > > > NVidia's grace-grace and grace-hopper machines. The 'netperf' > > > client is started in the VM hosted by grace-hopper machine, > > > while the 'netperf' server is running on grace-grace machine. > > > > > > The VM is started with virtio-net and vhost has been enabled. > > > We observe a error message spew from VM and then soft-lockup > > > report. The error message indicates the data associated with > > > the descriptor (index: 135) has been released, and the queue > > > is marked as broken. It eventually leads to the endless effort > > > to fetch free buffer (skb) in drivers/net/virtio_net.c::start_xmit() > > > and soft-lockup. The stale index 135 is fetched from the available > > > ring and published to the used ring by vhost, meaning we have > > > disordred write to the available ring element and available index. > > > > > > /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64 \ > > > -accel kvm -machine virt,gic-version=host \ > > > : \ > > > -netdev tap,id=vnet0,vhost=on \ > > > -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0 \ > > > > > > [ 19.993158] virtio_net virtio1: output.0:id 135 is not a head! > > > > > > Fix the issue by replacing virtio_wmb(vq->weak_barriers) with stronger > > > virtio_mb(false), equivalent to replaced 'dmb' by 'dsb' instruction on > > > ARM64. It should work for other architectures, but performance loss is > > > expected. > > > > > > Cc: stable@xxxxxxxxxxxxxxx > > > Reported-by: Yihuang Yu <yihyu@xxxxxxxxxx> > > > Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx> > > > --- > > > drivers/virtio/virtio_ring.c | 12 +++++++++--- > > > 1 file changed, 9 insertions(+), 3 deletions(-) > > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > > > index 49299b1f9ec7..7d852811c912 100644 > > > --- a/drivers/virtio/virtio_ring.c > > > +++ b/drivers/virtio/virtio_ring.c > > > @@ -687,9 +687,15 @@ static inline int virtqueue_add_split(struct virtqueue *_vq, > > > avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1); > > > vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head); > > > - /* Descriptors and available array need to be set before we expose the > > > - * new available array entries. */ > > > - virtio_wmb(vq->weak_barriers); > > > + /* > > > + * Descriptors and available array need to be set before we expose > > > + * the new available array entries. virtio_wmb() should be enough > > > + * to ensuere the order theoretically. However, a stronger barrier > > > + * is needed by ARM64. Otherwise, the stale data can be observed > > > + * by the host (vhost). A stronger barrier should work for other > > > + * architectures, but performance loss is expected. > > > + */ > > > + virtio_mb(false); > > > > > > I don't get what is going on here. Any explanation why virtio_wmb is not > > enough besides "it does not work"? > > > > The change is replacing instruction "dmb" with "dsb". "dsb" is stronger barrier > than "dmb" because "dsb" ensures that all memory accesses raised before this > instruction is completed when the 'dsb' instruction completes. However, "dmb" > doesn't guarantee the order of completion of the memory accesses. > > So 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)' > can be completed before 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'. Completed as observed by which CPU? We have 2 writes that we want observed by another CPU in order. So if CPU observes a new value of idx we want it to see new value in ring. This is standard use of smp_wmb() How are these 2 writes different? What DMB does, is that is seems to ensure that effects of 'vq->split.vring.avail->idx = cpu_to_virtio(_vq->vdev, vq->split.avail_idx_shadow)' are observed after effects of 'vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head)'. > The stronger barrier 'dsb' ensures the completion order as we expected. > > virtio_wmb(true) virt_mb(false) > virt_wmb mb > __smp_wmb __mb > dmb(ishst) dsb(sy) First, why would you want a non smp barrier when you are synchronizing with another CPU? This is what smp_ means ... > Extraced from ARMv9 specificaton > ================================ > The DMB instruction is a memory barrier instruction that ensures the relative > order of memory accesses before the barrier with memory accesses after the > barrier. The DMB instruction _does not_ ensure the completion of any of the > memory accesses for which it ensures relative order. Isn't this exactly what we need? > A DSB instruction is a memory barrier that ensures that memory accesses that > occur before the DSB instruction have __completed__ before the completion of > the DSB instruction. This seems appropriate for when you want to order things more strongly. I do not get why it's necessary here. > In doing this, it acts as a stronger barrier than a DMB > and all ordering that is created by a DMB with specific options is also generated > by a DSB with the same options. > > > > vq->split.avail_idx_shadow++; > > > vq->split.vring.avail->idx = cpu_to_virtio16(_vq->vdev, > > > vq->split.avail_idx_shadow); > > > -- > > > 2.44.0 > > > > Thanks, > Gavin