2010/12/17 Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx>: > 2010/12/16 Michael S. Tsirkin <mst@xxxxxxxxxx>: >> On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote: >>> 2010/12/16 Michael S. Tsirkin <mst@xxxxxxxxxx>: >>> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote: >>> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx>: >>> >> > 2010/12/2 Michael S. Tsirkin <mst@xxxxxxxxxx>: >>> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote: >>> >> >>> 2010/11/28 Michael S. Tsirkin <mst@xxxxxxxxxx>: >>> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote: >>> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@xxxxxxxxxx>: >>> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote: >>> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert >>> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation. >>> >> >>> >> >> >>> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx> >>> >> >>> >> > >>> >> >>> >> > This changes migration format, so it will break compatibility with >>> >> >>> >> > existing drivers. More generally, I think migrating internal >>> >> >>> >> > state that is not guest visible is always a mistake >>> >> >>> >> > as it ties migration format to an internal implementation >>> >> >>> >> > (yes, I know we do this sometimes, but we should at least >>> >> >>> >> > try not to add such cases). I think the right thing to do in this case >>> >> >>> >> > is to flush outstanding >>> >> >>> >> > work when vm is stopped. Then, we are guaranteed that inuse is 0. >>> >> >>> >> > I sent patches that do this for virtio net and block. >>> >> >>> >> >>> >> >>> >> Could you give me the link of your patches? I'd like to test >>> >> >>> >> whether they work with Kemari upon failover. If they do, I'm >>> >> >>> >> happy to drop this patch. >>> >> >>> >> >>> >> >>> >> Yoshi >>> >> >>> > >>> >> >>> > Look for this: >>> >> >>> > stable migration image on a stopped vm >>> >> >>> > sent on: >>> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200 >>> >> >>> >>> >> >>> Thanks for the info. >>> >> >>> >>> >> >>> However, The patch series above didn't solve the issue. In >>> >> >>> case of Kemari, inuse is mostly > 0 because it queues the >>> >> >>> output, and while last_avail_idx gets incremented >>> >> >>> immediately, not sending inuse makes the state inconsistent >>> >> >>> between Primary and Secondary. >>> >> >> >>> >> >> Hmm. Can we simply avoid incrementing last_avail_idx? >>> >> > >>> >> > I think we can calculate or prepare an internal last_avail_idx, >>> >> > and update the external when inuse is decremented. I'll try >>> >> > whether it work w/ w/o Kemari. >>> >> >>> >> Hi Michael, >>> >> >>> >> Could you please take a look at the following patch? >>> > >>> > Which version is this against? >>> >>> Oops. It should be very old. >>> 67f895bfe69f323b427b284430b6219c8a62e8d4 >>> >>> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912 >>> >> Author: Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx> >>> >> Date: Thu Dec 16 14:50:54 2010 +0900 >>> >> >>> >> virtio: update last_avail_idx when inuse is decreased. >>> >> >>> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx> >>> > >>> > It would be better to have a commit description explaining why a change >>> > is made, and why it is correct, not just repeating what can be seen from >>> > the diff anyway. >>> >>> Sorry for being lazy here. >>> >>> >> diff --git a/hw/virtio.c b/hw/virtio.c >>> >> index c8a0fc6..6688c02 100644 >>> >> --- a/hw/virtio.c >>> >> +++ b/hw/virtio.c >>> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count) >>> >> wmb(); >>> >> trace_virtqueue_flush(vq, count); >>> >> vring_used_idx_increment(vq, count); >>> >> + vq->last_avail_idx += count; >>> >> vq->inuse -= count; >>> >> } >>> >> >>> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem) >>> >> unsigned int i, head, max; >>> >> target_phys_addr_t desc_pa = vq->vring.desc; >>> >> >>> >> - if (!virtqueue_num_heads(vq, vq->last_avail_idx)) >>> >> + if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse)) >>> >> return 0; >>> >> >>> >> /* When we start there are none of either input nor output. */ >>> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem) >>> >> >>> >> max = vq->vring.num; >>> >> >>> >> - i = head = virtqueue_get_head(vq, vq->last_avail_idx++); >>> >> + i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse); >>> >> >>> >> if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) { >>> >> if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) { >>> >> >>> > >>> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes? >>> >>> I think there are two problems. >>> >>> 1. When to update last_avail_idx. >>> 2. The ordering issue you're mentioning below. >>> >>> The patch above is only trying to address 1 because last time you >>> mentioned that modifying last_avail_idx upon save may break the >>> guest, which I agree. If virtio_queue_empty and >>> virtqueue_avail_bytes are only used internally, meaning invisible >>> to the guest, I guess the approach above can be applied too. >> >> So IMHO 2 is the real issue. This is what was problematic >> with the save patch, otherwise of course changes in save >> are better than changes all over the codebase. > > All right. Then let's focus on 2 first. > >>> > Previous patch version sure looked simpler, and this seems functionally >>> > equivalent, so my question still stands: here it is rephrased in a >>> > different way: >>> > >>> > assume that we have in avail ring 2 requests at start of ring: A and B in this order >>> > >>> > host pops A, then B, then completes B and flushes >>> > >>> > now with this patch last_avail_idx will be 1, and then >>> > remote will get it, it will execute B again. As a result >>> > B will complete twice, and apparently A will never complete. >>> > >>> > >>> > This is what I was saying below: assuming that there are >>> > outstanding requests when we migrate, there is no way >>> > a single index can be enough to figure out which requests >>> > need to be handled and which are in flight already. >>> > >>> > We must add some kind of bitmask to tell us which is which. >>> >>> I should understand why this inversion can happen before solving >>> the issue. >> >> It's a fundamental thing in virtio. >> I think it is currently only likely to happen with block, I think tap >> currently completes things in order. In any case relying on this in the >> frontend is a mistake. >> >>> Currently, how are you making virio-net to flush >>> every requests for live migration? Is it qemu_aio_flush()? >> >> Think so. > > If qemu_aio_flush() is responsible for flushing the outstanding > virtio-net requests, I'm wondering why it's a problem for Kemari. > As I described in the previous message, Kemari queues the > requests first. So in you example above, it should start with > > virtio-net: last_avai_idx 0 inuse 2 > event-tap: {A,B} > > As you know, the requests are still in order still because net > layer initiates in order. Not about completing. > > In the first synchronization, the status above is transferred. In > the next synchronization, the status will be as following. > > virtio-net: last_avai_idx 1 inuse 1 > event-tap: {B} > > Why? Because Kemari flushes the first virtio-net request using > qemu_aio_flush() before each synchronization. If > qemu_aio_flush() doesn't guarantee the order, what you pointed > should be problematic. So in the final synchronization, the > state should be, > > virtio-net: last_avai_idx 2 inuse 0 > event-tap: {} > > where A,B were completed in order. > > Yoshi Hi Michael, Please let me know if the discussion has gone wrong or my explanation was incorrect. I believe live migration shouldn't be a problem either because it would flush until inuse gets 0. Yoshi > > >> >>> > >>> >> > >>> >> >> >>> >> >>> I'm wondering why >>> >> >>> last_avail_idx is OK to send but not inuse. >>> >> >> >>> >> >> last_avail_idx is at some level a mistake, it exposes part of >>> >> >> our internal implementation, but it does *also* express >>> >> >> a guest observable state. >>> >> >> >>> >> >> Here's the problem that it solves: just looking at the rings in virtio >>> >> >> there is no way to detect that a specific request has already been >>> >> >> completed. And the protocol forbids completing the same request twice. >>> >> >> >>> >> >> Our implementation always starts processing the requests >>> >> >> in order, and since we flush outstanding requests >>> >> >> before save, it works to just tell the remote 'process only requests >>> >> >> after this place'. >>> >> >> >>> >> >> But there's no such requirement in the virtio protocol, >>> >> >> so to be really generic we could add a bitmask of valid avail >>> >> >> ring entries that did not complete yet. This would be >>> >> >> the exact representation of the guest observable state. >>> >> >> In practice we have rings of up to 512 entries. >>> >> >> That's 64 byte per ring, not a lot at all. >>> >> >> >>> >> >> However, if we ever do change the protocol to send the bitmask, >>> >> >> we would need some code to resubmit requests >>> >> >> out of order, so it's not trivial. >>> >> >> >>> >> >> Another minor mistake with last_avail_idx is that it has >>> >> >> some redundancy: the high bits in the index >>> >> >> (> vq size) are not necessary as they can be >>> >> >> got from avail idx. There's a consistency check >>> >> >> in load but we really should try to use formats >>> >> >> that are always consistent. >>> >> >> >>> >> >>> The following patch does the same thing as original, yet >>> >> >>> keeps the format of the virtio. It shouldn't break live >>> >> >>> migration either because inuse should be 0. >>> >> >>> >>> >> >>> Yoshi >>> >> >> >>> >> >> Question is, can you flush to make inuse 0 in kemari too? >>> >> >> And if not, how do you handle the fact that some requests >>> >> >> are in flight on the primary? >>> >> > >>> >> > Although we try flushing requests one by one making inuse 0, >>> >> > there are cases when it failovers to the secondary when inuse >>> >> > isn't 0. We handle these in flight request on the primary by >>> >> > replaying on the secondary. >>> >> > >>> >> >> >>> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c >>> >> >>> index c8a0fc6..875c7ca 100644 >>> >> >>> --- a/hw/virtio.c >>> >> >>> +++ b/hw/virtio.c >>> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) >>> >> >>> qemu_put_be32(f, i); >>> >> >>> >>> >> >>> for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { >>> >> >>> + uint16_t last_avail_idx; >>> >> >>> + >>> >> >>> if (vdev->vq[i].vring.num == 0) >>> >> >>> break; >>> >> >>> >>> >> >>> + last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse; >>> >> >>> + >>> >> >>> qemu_put_be32(f, vdev->vq[i].vring.num); >>> >> >>> qemu_put_be64(f, vdev->vq[i].pa); >>> >> >>> - qemu_put_be16s(f, &vdev->vq[i].last_avail_idx); >>> >> >>> + qemu_put_be16s(f, &last_avail_idx); >>> >> >>> if (vdev->binding->save_queue) >>> >> >>> vdev->binding->save_queue(vdev->binding_opaque, i, f); >>> >> >>> } >>> >> >>> >>> >> >>> >>> >> >> >>> >> >> This looks wrong to me. Requests can complete in any order, can they >>> >> >> not? So if request 0 did not complete and request 1 did not, >>> >> >> you send avail - inuse and on the secondary you will process and >>> >> >> complete request 1 the second time, crashing the guest. >>> >> > >>> >> > In case of Kemari, no. We sit between devices and net/block, and >>> >> > queue the requests. After completing each transaction, we flush >>> >> > the requests one by one. So there won't be completion inversion, >>> >> > and therefore won't be visible to the guest. >>> >> > >>> >> > Yoshi >>> >> > >>> >> >> >>> >> >>> >>> >> >>> > >>> >> >>> >> > >>> >> >>> >> >> --- >>> >> >>> >> >> hw/virtio.c | 8 +++++++- >>> >> >>> >> >> 1 files changed, 7 insertions(+), 1 deletions(-) >>> >> >>> >> >> >>> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c >>> >> >>> >> >> index 849a60f..5509644 100644 >>> >> >>> >> >> --- a/hw/virtio.c >>> >> >>> >> >> +++ b/hw/virtio.c >>> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue >>> >> >>> >> >> VRing vring; >>> >> >>> >> >> target_phys_addr_t pa; >>> >> >>> >> >> uint16_t last_avail_idx; >>> >> >>> >> >> - int inuse; >>> >> >>> >> >> + uint16_t inuse; >>> >> >>> >> >> uint16_t vector; >>> >> >>> >> >> void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq); >>> >> >>> >> >> VirtIODevice *vdev; >>> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) >>> >> >>> >> >> qemu_put_be32(f, vdev->vq[i].vring.num); >>> >> >>> >> >> qemu_put_be64(f, vdev->vq[i].pa); >>> >> >>> >> >> qemu_put_be16s(f, &vdev->vq[i].last_avail_idx); >>> >> >>> >> >> + qemu_put_be16s(f, &vdev->vq[i].inuse); >>> >> >>> >> >> if (vdev->binding->save_queue) >>> >> >>> >> >> vdev->binding->save_queue(vdev->binding_opaque, i, f); >>> >> >>> >> >> } >>> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f) >>> >> >>> >> >> vdev->vq[i].vring.num = qemu_get_be32(f); >>> >> >>> >> >> vdev->vq[i].pa = qemu_get_be64(f); >>> >> >>> >> >> qemu_get_be16s(f, &vdev->vq[i].last_avail_idx); >>> >> >>> >> >> + qemu_get_be16s(f, &vdev->vq[i].inuse); >>> >> >>> >> >> + >>> >> >>> >> >> + /* revert last_avail_idx if there are outstanding emulation. */ >>> >> >>> >> >> + vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse; >>> >> >>> >> >> + vdev->vq[i].inuse = 0; >>> >> >>> >> >> >>> >> >>> >> >> if (vdev->vq[i].pa) { >>> >> >>> >> >> virtqueue_init(&vdev->vq[i]); >>> >> >>> >> >> -- >>> >> >>> >> >> 1.7.1.2 >>> >> >>> >> >> >>> >> >>> >> >> -- >>> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in >>> >> >>> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> >> >>> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >>> >> > -- >>> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in >>> >> >>> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >>> >> >>> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >>> >> > >>> >> >>> > -- >>> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in >>> >> >>> > the body of a message to majordomo@xxxxxxxxxxxxxxx >>> >> >>> > More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >>> > >>> >> >> -- >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in >>> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> >>> >> > >>> > -- >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in >>> > the body of a message to majordomo@xxxxxxxxxxxxxxx >>> > More majordomo info at http://vger.kernel.org/majordomo-info.html >>> > >> -- >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html