Re: [PATCH] virtio_ring: Fix the stale index in available ring

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Wed, 20 Mar 2024 03:14:52 -0400

On Wed, Mar 20, 2024 at 03:24:16PM +1000, Gavin Shan wrote:
> On 3/20/24 10:49, Michael S. Tsirkin wrote:>
> > I think you are wasting the time with these tests. Even if it helps what
> > does this tell us? Try setting a flag as I suggested elsewhere.
> > Then check it in vhost.
> > Or here's another idea - possibly easier. Copy the high bits from index
> > into ring itself. Then vhost can check that head is synchronized with
> > index.
> > 
> > Warning: completely untested, not even compiled. But should give you
> > the idea. If this works btw we should consider making this official in
> > the spec.
> > 
> > 
> >   static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 6f7e5010a673..79456706d0bd 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -685,7 +685,8 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
> >   	/* Put entry in available array (but don't update avail->idx until they
> >   	 * do sync). */
> >   	avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
> > -	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, head);
> > +	u16 headwithflag = head | (q->split.avail_idx_shadow & ~(vq->split.vring.num - 1));
> > +	vq->split.vring.avail->ring[avail] = cpu_to_virtio16(_vq->vdev, headwithflag);
> >   	/* Descriptors and available array need to be set before we expose the
> >   	 * new available array entries. */
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..bd8f7c763caa 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1299,8 +1299,15 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> >   static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> >   				       __virtio16 *head, int idx)
> >   {
> > -	return vhost_get_avail(vq, *head,
> > +	unsigned i = idx;
> > +	unsigned flag = i & ~(vq->num - 1);
> > +	unsigned val = vhost_get_avail(vq, *head,
> >   			       &vq->avail->ring[idx & (vq->num - 1)]);
> > +	unsigned valflag = val & ~(vq->num - 1);
> > +
> > +	WARN_ON(valflag != flag);
> > +
> > +	return val & (vq->num - 1);
> >   }
> 
> Thanks, Michael. The code is already self-explanatory.

Apparently not. See below.

> Since vq->num is 256, I just
> squeezed the last_avail_idx to the high byte. Unfortunately, I'm unable to hit
> the WARN_ON(). Does it mean the low byte is stale (or corrupted) while the high
> byte is still correct and valid?

I would find this very surprising.

>         avail = vq->split.avail_idx_shadow & (vq->split.vring.num - 1);
>         vq->split.vring.avail->ring[avail] =
>                 cpu_to_virtio16(_vq->vdev, head | (avail << 8));
> 
> 
>         head = vhost16_to_cpu(vq, ring_head);
>         WARN_ON((head >> 8) != (vq->last_avail_idx % vq->num));
>         head = head & 0xff;

This code misses the point of the test.
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected/
The high value you store now is exactly the same each time you
go around the ring. E.g. at beginning of ring you now always
store 0 as high byte. So a stale value will not be detected.

The value you are interested in should change
each time you go around the ring a full circle.
Thus you want exactly the *high byte* of avail idx -
this is what my patch did - your patch instead
stored and compared the low byte.

The advantage of this debugging patch is that it will detect the issue immediately
not after guest detected the problem in the used ring.
For example, you can add code to re-read the value, or dump the whole
ring.

> One question: Does QEMU has any chance writing data to the available queue when
> vhost is enabled? My previous understanding is no, the queue is totally owned by
> vhost instead of QEMU.

It shouldn't do it normally.

> Before this patch was posted, I had debugging code to record last 16 transactions
> to the available and used queue from guest and host side. It did reveal the wrong
> head was fetched from the available queue.

Oh nice that's a very good hint. And is this still reproducible?

> [   11.785745] ================ virtqueue_get_buf_ctx_split ================
> [   11.786238] virtio_net virtio0: output.0:id 74 is not a head!
> [   11.786655] head to be released: 036 077
> [   11.786952]
> [   11.786952] avail_idx:
> [   11.787234] 000  63985  <--
> [   11.787237] 001  63986
> [   11.787444] 002  63987
> [   11.787632] 003  63988
> [   11.787821] 004  63989
> [   11.788006] 005  63990
> [   11.788194] 006  63991
> [   11.788381] 007  63992
> [   11.788567] 008  63993
> [   11.788772] 009  63994
> [   11.788957] 010  63995
> [   11.789141] 011  63996
> [   11.789327] 012  63997
> [   11.789515] 013  63998
> [   11.789701] 014  63999
> [   11.789886] 015  64000
> [   11.790068]
> [   11.790068] avail_head:
> [   11.790529] 000  075  <--
> [   11.790718] 001  036
> [   11.790890] 002  077
> [   11.791061] 003  129
> [   11.791231] 004  072
> [   11.791400] 005  130
> [   11.791574] 006  015
> [   11.791748] 007  074
> [   11.791918] 008  130
> [   11.792094] 009  130
> [   11.792263] 010  074
> [   11.792437] 011  015
> [   11.792617] 012  072
> [   11.792788] 013  129
> [   11.792961] 014  077    // The last two heads from guest to host: 077, 036
> [   11.793134] 015  036

Maybe dump the avail ring from guest to make sure
it matches the expected contents?

> [root@nvidia-grace-hopper-05 qemu.main]# cat /proc/vhost
> 
> avail_idx
> 000  63998
> 001  64000
> 002  63954  <---
> 003  63955
> 004  63956
> 005  63974
> 006  63981
> 007  63984
> 008  63986
> 009  63987
> 010  63988
> 011  63989
> 012  63992
> 013  63993
> 014  63995
> 015  63997
> 
> avail_head
> 000  074
> 001  015
> 002  072
> 003  129
> 004  074            // The last two heads seen by vhost is: 074, 036
> 005  036
> 006  075  <---

And is 074 the previous (stale) value in the ring?

> 007  036
> 008  077
> 009  129
> 010  072
> 011  130
> 012  015
> 013  074
> 014  130
> 015  130

> used_idx
> 000  64000
> 001  63882  <---
> 002  63889
> 003  63891
> 004  63898
> 005  63936
> 006  63942
> 007  63946
> 008  63949
> 009  63953
> 010  63957
> 011  63981
> 012  63990
> 013  63992
> 014  63993
> 015  63999
> 
> used_head
> 000  072
> 001  129
> 002  074          // The last two heads published to guest is: 074, 036
> 003  036
> 004  075  <---
> 005  036
> 006  077
> 007  129
> 008  072
> 009  130
> 010  015
> 011  074
> 012  130
> 013  130
> 014  074
> 015  015
> 
> Thanks,
> Gavin

I like this debugging patch, it might make sense to
polish it up and include in production. We'll want it in
debugfs naturally not /proc/vhost.

But all in good time.

> 
>