On Mon, Dec 17, 2018 at 11:56:59PM +0000, Steven Luong (sluong) wrote: > > > On 12/17/18, 2:55 PM, "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote: > > On Thu, Dec 13, 2018 at 11:24:28PM +0000, Steven Luong (sluong) via Virtualization wrote: > > Folks, > > > > > > > > We came across a memory race condition between VPP vhost driver and the kernel > > vhost. VPP is running a tap interface over vhost backend. In this case, VPP is > > acting as the vhost driver mode and the kernel vhost is acting as the vhost > > device mode. > > > > > > > > In the kernel vhost’s TX traffic direction which is VPP’s RX traffic direction, > > kernel vhost is the producer and VPP is the consumer. > > Looking at vhost net kernel code, it seems to use the > same terminology as virtio net. > Can you pls clarify which ring number do you use 0 or 1? > > <SVL> 0. </SVL> > > I will assume ring 0 below. > > > > > Kernel vhost places > > traffic in kernel vhost’s TX vring. VPP removes traffic in VPP’s RX vring. > > > So VPP makes buffers available and vhost kernel uses them? > > <SVL> Correct. </SVL> > > > It > > is inevitable that the vring may become full under heavy traffic and the kernel > > vhost may have to stop and wait for the guest (VPP) to empty the vring and to > > refill the vring with descriptors. When that case happens, kernel vhost clears > > the bit in the vring’s used flag to request an interrupt/notification. Due to > > shared memory race condition, VPP may miss the clearing of the vring’s used > > flag from kernel vhost and didn’t send kernel vhost an interrupt after VPP > > empties and refills the vring with new descriptors. > > So kernel vhost's wakeups are signalled through eventfd - I assume > this is what you mean by an interrupt? > > > To prevent the race that you describe, vhost has this code: > > /* OK, now we need to know about added descriptors. */ > if (!headcount) { > if (unlikely(busyloop_intr)) { > vhost_poll_queue(&vq->poll); > } else if (unlikely(vhost_enable_notify(&net->dev, vq))) { > /* They have slipped one in as we were > * doing that: check again. */ > vhost_disable_notify(&net->dev, vq); > continue; > } > /* Nothing new? Wait for eventfd to tell us > > So ring should be re-checked after notifications are enabled. > If ring is no longer empty, vhost will proceed to handle the ring. > This code has been around for many years. > > Thus if VPP doesn't work with it then I suspect it does something > differently, e.g. is it possible that it is missing a memory > barrier after writing out the available buffers and before > checking the flags? > > <SVL> Can the race condition be avoided totally? Here is the scenario. > > t0: VPP refills the vring with new descriptors, not yet sets avail->idx to make it available to kernel vhost. > t1: kernel vhost checks the vring, still sees no space available, but does not yet execute the line of code to clear the used->flags > t2: vpp executes sfence, and sets avail->idx to make it available to kernel vhost At this point you need a full memory barrier. E.g. mfence, or xor. Otherwise the read can get re-ordered before the write. > t3: vpp checks used->flags, it is still 1, vpp skips sending the interrupt > t4: kernel vhost clears used->flags to indicate interrupt is required. But VPP just missed it. fine but at this point kernel vhost will recheck using vhost_get_avail and process the buffers. So the lack of notification isn't a problem: bool vhost_enable_notify(struct vhost_dev *dev, struct vhost_virtqueue *vq) { __virtio16 avail_idx; int r; if (!(vq->used_flags & VRING_USED_F_NO_NOTIFY)) return false; vq->used_flags &= ~VRING_USED_F_NO_NOTIFY; if (!vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX)) { r = vhost_update_used_flags(vq); if (r) { vq_err(vq, "Failed to enable notification at %p: %d\n", &vq->used->flags, r); return false; } } else { r = vhost_update_avail_event(vq, vq->avail_idx); if (r) { vq_err(vq, "Failed to update avail event index at %p: %d\n", vhost_avail_event(vq), r); return false; } } /* They could have slipped one in as we were doing that: make * sure it's written, then check again. */ smp_mb(); r = vhost_get_avail(vq, avail_idx, &vq->avail->idx); if (r) { vq_err(vq, "Failed to check avail idx at %p: %d\n", &vq->avail->idx, r); return false; } return vhost16_to_cpu(vq, avail_idx) != vq->avail_idx; } >The result is kernel vhost got stuck even though the vring is filled with new descriptors. Shouldn't be. Maybe you are missing a fence before reading used->flags? > Steven > > </SVL> > > > > Unfortunately, this missed > > notification causes the kernel vhost to be stuck because once the kernel vhost > > is waiting for an interrupt/notification from the guest, only an interrupt/ > > notification from the guest can resume/re-enable the guest from submitting new > > packets to the vring. This design seems vulnerable. > > The protocol right now relies on notifications never being lost. > > I can imagine an alternative where device also re-checks > the rings periodically, but that would involve running > timers host side which is only possible if there's a > free processor that can handle them. Further, > it will still lead to stalls and packet drops, which will > be harder to debug. > > That's just my $.02, pls feel free to take it with the virtio tc > maybe others will feel differently. > > > Should the kernel vhost > > totally depend on the notification from the guest? Quoting the text from > > > > > > > > http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html > > > > > > > > /* The device uses this in used->flags to advise the driver: don’t kick me > > * when you add a buffer. It’s unreliable, so it’s simply an > > * optimization. */ > > #define VIRTQ_USED_F_NO_NOTIFY 1 > > > > > > > > I interpret that the notification is simply an optimization, not a reliable > > notification mechanism. > > What was meant I think is that suppression is unreliable. > > >So the kernel vhost should not bet the farm on it. > > > > > > > > We encounter the same race condition in kernel vhost’s RX traffic direction > > which causes the kernel vhost to be stuck due to missed interrupt/notification > > although it is less frequent. > > > For the reverse direction the spec actually has some pseudo-code and a suggestion > about handling that race: > > For optimal performance, a driver MAY disable used buffer notifications while processing the used > ring, but beware the problem of missing notifications between emptying the ring and reenabling no- > tifications. This is usually handled by re-checking for more used buffers after notifications are re- > enabled: > > virtq_disable_used_buffer_notifications(vq); > for (;;) { > if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) { > virtq_enable_used_buffer_notifications(vq); > mb(); > if (vq->last_seen_used != le16_to_cpu(virtq->used.idx)) > break; > virtq_disable_used_buffer_notifications(vq); > } > struct virtq_used_elem *e = virtq.used->ring[vq->last_seen_used%vsz]; > process_buffer(e); > vq->last_seen_used++; > > > > > > > > > Steven > > > > > _______________________________________________ > > Virtualization mailing list > > Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx > > https://lists.linuxfoundation.org/mailman/listinfo/virtualization > > > _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization