2010/12/26 Michael S. Tsirkin <mst@xxxxxxxxxx>: > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote: >> 2010/12/24 Michael S. Tsirkin <mst@xxxxxxxxxx>: >> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote: >> >> 2010/12/16 Michael S. Tsirkin <mst@xxxxxxxxxx>: >> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote: >> >> >> 2010/12/16 Michael S. Tsirkin <mst@xxxxxxxxxx>: >> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote: >> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx>: >> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@xxxxxxxxxx>: >> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote: >> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@xxxxxxxxxx>: >> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote: >> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@xxxxxxxxxx>: >> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote: >> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert >> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation. >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx> >> >> >> >> >>> >> > >> >> >> >> >>> >> > This changes migration format, so it will break compatibility with >> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal >> >> >> >> >>> >> > state that is not guest visible is always a mistake >> >> >> >> >>> >> > as it ties migration format to an internal implementation >> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least >> >> >> >> >>> >> > try not to add such cases). I think the right thing to do in this case >> >> >> >> >>> >> > is to flush outstanding >> >> >> >> >>> >> > work when vm is stopped. Then, we are guaranteed that inuse is 0. >> >> >> >> >>> >> > I sent patches that do this for virtio net and block. >> >> >> >> >>> >> >> >> >> >> >>> >> Could you give me the link of your patches? I'd like to test >> >> >> >> >>> >> whether they work with Kemari upon failover. If they do, I'm >> >> >> >> >>> >> happy to drop this patch. >> >> >> >> >>> >> >> >> >> >> >>> >> Yoshi >> >> >> >> >>> > >> >> >> >> >>> > Look for this: >> >> >> >> >>> > stable migration image on a stopped vm >> >> >> >> >>> > sent on: >> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200 >> >> >> >> >>> >> >> >> >> >>> Thanks for the info. >> >> >> >> >>> >> >> >> >> >>> However, The patch series above didn't solve the issue. In >> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the >> >> >> >> >>> output, and while last_avail_idx gets incremented >> >> >> >> >>> immediately, not sending inuse makes the state inconsistent >> >> >> >> >>> between Primary and Secondary. >> >> >> >> >> >> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx? >> >> >> >> > >> >> >> >> > I think we can calculate or prepare an internal last_avail_idx, >> >> >> >> > and update the external when inuse is decremented. I'll try >> >> >> >> > whether it work w/ w/o Kemari. >> >> >> >> >> >> >> >> Hi Michael, >> >> >> >> >> >> >> >> Could you please take a look at the following patch? >> >> >> > >> >> >> > Which version is this against? >> >> >> >> >> >> Oops. It should be very old. >> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4 >> >> >> >> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912 >> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx> >> >> >> >> Date: Thu Dec 16 14:50:54 2010 +0900 >> >> >> >> >> >> >> >> virtio: update last_avail_idx when inuse is decreased. >> >> >> >> >> >> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx> >> >> >> > >> >> >> > It would be better to have a commit description explaining why a change >> >> >> > is made, and why it is correct, not just repeating what can be seen from >> >> >> > the diff anyway. >> >> >> >> >> >> Sorry for being lazy here. >> >> >> >> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c >> >> >> >> index c8a0fc6..6688c02 100644 >> >> >> >> --- a/hw/virtio.c >> >> >> >> +++ b/hw/virtio.c >> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count) >> >> >> >> wmb(); >> >> >> >> trace_virtqueue_flush(vq, count); >> >> >> >> vring_used_idx_increment(vq, count); >> >> >> >> + vq->last_avail_idx += count; >> >> >> >> vq->inuse -= count; >> >> >> >> } >> >> >> >> >> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem) >> >> >> >> unsigned int i, head, max; >> >> >> >> target_phys_addr_t desc_pa = vq->vring.desc; >> >> >> >> >> >> >> >> - if (!virtqueue_num_heads(vq, vq->last_avail_idx)) >> >> >> >> + if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse)) >> >> >> >> return 0; >> >> >> >> >> >> >> >> /* When we start there are none of either input nor output. */ >> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem) >> >> >> >> >> >> >> >> max = vq->vring.num; >> >> >> >> >> >> >> >> - i = head = virtqueue_get_head(vq, vq->last_avail_idx++); >> >> >> >> + i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse); >> >> >> >> >> >> >> >> if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) { >> >> >> >> if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) { >> >> >> >> >> >> >> > >> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes? >> >> >> >> >> >> I think there are two problems. >> >> >> >> >> >> 1. When to update last_avail_idx. >> >> >> 2. The ordering issue you're mentioning below. >> >> >> >> >> >> The patch above is only trying to address 1 because last time you >> >> >> mentioned that modifying last_avail_idx upon save may break the >> >> >> guest, which I agree. If virtio_queue_empty and >> >> >> virtqueue_avail_bytes are only used internally, meaning invisible >> >> >> to the guest, I guess the approach above can be applied too. >> >> > >> >> > So IMHO 2 is the real issue. This is what was problematic >> >> > with the save patch, otherwise of course changes in save >> >> > are better than changes all over the codebase. >> >> >> >> All right. Then let's focus on 2 first. >> >> >> >> >> > Previous patch version sure looked simpler, and this seems functionally >> >> >> > equivalent, so my question still stands: here it is rephrased in a >> >> >> > different way: >> >> >> > >> >> >> > assume that we have in avail ring 2 requests at start of ring: A and B in this order >> >> >> > >> >> >> > host pops A, then B, then completes B and flushes >> >> >> > >> >> >> > now with this patch last_avail_idx will be 1, and then >> >> >> > remote will get it, it will execute B again. As a result >> >> >> > B will complete twice, and apparently A will never complete. >> >> >> > >> >> >> > >> >> >> > This is what I was saying below: assuming that there are >> >> >> > outstanding requests when we migrate, there is no way >> >> >> > a single index can be enough to figure out which requests >> >> >> > need to be handled and which are in flight already. >> >> >> > >> >> >> > We must add some kind of bitmask to tell us which is which. >> >> >> >> >> >> I should understand why this inversion can happen before solving >> >> >> the issue. >> >> > >> >> > It's a fundamental thing in virtio. >> >> > I think it is currently only likely to happen with block, I think tap >> >> > currently completes things in order. In any case relying on this in the >> >> > frontend is a mistake. >> >> > >> >> >> Currently, how are you making virio-net to flush >> >> >> every requests for live migration? Is it qemu_aio_flush()? >> >> > >> >> > Think so. >> >> >> >> If qemu_aio_flush() is responsible for flushing the outstanding >> >> virtio-net requests, I'm wondering why it's a problem for Kemari. >> >> As I described in the previous message, Kemari queues the >> >> requests first. So in you example above, it should start with >> >> >> >> virtio-net: last_avai_idx 0 inuse 2 >> >> event-tap: {A,B} >> >> >> >> As you know, the requests are still in order still because net >> >> layer initiates in order. Not about completing. >> >> >> >> In the first synchronization, the status above is transferred. In >> >> the next synchronization, the status will be as following. >> >> >> >> virtio-net: last_avai_idx 1 inuse 1 >> >> event-tap: {B} >> > >> > OK, this answers the ordering question. >> >> Glad to hear that! >> >> > Another question: at this point we transfer this status: both >> > event-tap and virtio ring have the command B, >> > so the remote will have: >> > >> > virtio-net: inuse 0 >> > event-tap: {B} >> > >> > Is this right? This already seems to be a problem as when B completes >> > inuse will go negative? >> >> I think state above is wrong. inuse 0 means there shouldn't be >> any requests in event-tap. Note that the callback is called only >> when event-tap flushes the requests. >> >> > Next it seems that the remote virtio will resubmit B to event-tap. The >> > remote will then have: >> > >> > virtio-net: inuse 1 >> > event-tap: {B, B} >> > >> > This looks kind of wrong ... will two packets go out? >> >> No. Currently, we're just replaying the requests with pio/mmio. > > You do? What purpose do the hooks in bdrv/net serve then? > A placeholder for the future? Not only for that reason. The hooks in bdrv/net is the main function that queues requests and starts synchronization. pio/mmio hooks are there for recording what initiated the requests monitored in bdrv/net layer. I would like to remove pio/mmio part if we could make bdrv/net level replay is possible. Yoshi > >> In the situation above, it should be, >> >> virtio-net: inuse 1 >> event-tap: {B} >> >> >> Why? Because Kemari flushes the first virtio-net request using >> >> qemu_aio_flush() before each synchronization. If >> >> qemu_aio_flush() doesn't guarantee the order, what you pointed >> >> should be problematic. So in the final synchronization, the >> >> state should be, >> >> >> >> virtio-net: last_avai_idx 2 inuse 0 >> >> event-tap: {} >> >> >> >> where A,B were completed in order. >> >> >> >> Yoshi >> > >> > >> > It might be better to discuss block because that's where >> > requests can complete out of order. >> >> It's same as net. We queue requests and call bdrv_flush per >> sending requests to the block. So there shouldn't be any >> inversion. >> >> > So let me see if I understand: >> > - each command passed to event tap is queued by it, >> > it is not passed directly to the backend >> > - later requests are passed to the backend, >> > always in the same order that they were submitted >> > - each synchronization point flushes all requests >> > passed to the backend so far >> > - each synchronization transfers all requests not passed to the backend, >> > to the remote, and they are replayed there >> >> Correct. >> >> > Now to analyse this for correctness I am looking at the original patch >> > because it is smaller so easier to analyse and I think it is >> > functionally equivalent, correct me if I am wrong in this. >> >> So you think decreasing last_avail_idx upon save is better than >> updating it in the callback? >> >> > So the reason there's no out of order issue is this >> > (and might be a good thing to put in commit log >> > or a comment somewhere): >> >> I've done some in the latest patch. Please point it out if it >> wasn't enough. >> >> > At point of save callback event tap has flushed commands >> > passed to the backend already. Thus at the point of >> > the save callback if a command has completed >> > all previous commands have been flushed and completed. >> > >> > >> > Therefore inuse is >> > in fact the # of requests passed to event tap but not yet >> > passed to the backend (for non-event tap case all commands are >> > passed to the backend immediately and because of this >> > inuse is 0) and these are the last inuse commands submitted. >> > >> > >> > Right? >> >> Yep. >> >> > Now a question: >> > >> > When we pass last_used_index - inuse to the remote, >> > the remote virtio will resubmit the request. >> > Since request is also passed by event tap, we get >> > the request twice, why is this not a problem? >> >> It's not a problem because event-tap currently replays with >> pio/mmio only, as I mentioned above. Although event-tap receives >> information about the queued requests, it won't pass it to the >> backend. The reason is the problem in setting the callbacks >> which are specific to devices on the secondary. These are >> pointers, and even worse, are usually static functions, which >> event-tap has no way to restore it upon failover. I do want to >> change event-tap replay to be this way in the future, pio/mmio >> replay is implemented for now. >> >> Thanks, >> >> Yoshi >> >> > >> > >> >> > >> >> >> > >> >> >> >> > >> >> >> >> >> >> >> >> >> >>> I'm wondering why >> >> >> >> >>> last_avail_idx is OK to send but not inuse. >> >> >> >> >> >> >> >> >> >> last_avail_idx is at some level a mistake, it exposes part of >> >> >> >> >> our internal implementation, but it does *also* express >> >> >> >> >> a guest observable state. >> >> >> >> >> >> >> >> >> >> Here's the problem that it solves: just looking at the rings in virtio >> >> >> >> >> there is no way to detect that a specific request has already been >> >> >> >> >> completed. And the protocol forbids completing the same request twice. >> >> >> >> >> >> >> >> >> >> Our implementation always starts processing the requests >> >> >> >> >> in order, and since we flush outstanding requests >> >> >> >> >> before save, it works to just tell the remote 'process only requests >> >> >> >> >> after this place'. >> >> >> >> >> >> >> >> >> >> But there's no such requirement in the virtio protocol, >> >> >> >> >> so to be really generic we could add a bitmask of valid avail >> >> >> >> >> ring entries that did not complete yet. This would be >> >> >> >> >> the exact representation of the guest observable state. >> >> >> >> >> In practice we have rings of up to 512 entries. >> >> >> >> >> That's 64 byte per ring, not a lot at all. >> >> >> >> >> >> >> >> >> >> However, if we ever do change the protocol to send the bitmask, >> >> >> >> >> we would need some code to resubmit requests >> >> >> >> >> out of order, so it's not trivial. >> >> >> >> >> >> >> >> >> >> Another minor mistake with last_avail_idx is that it has >> >> >> >> >> some redundancy: the high bits in the index >> >> >> >> >> (> vq size) are not necessary as they can be >> >> >> >> >> got from avail idx. There's a consistency check >> >> >> >> >> in load but we really should try to use formats >> >> >> >> >> that are always consistent. >> >> >> >> >> >> >> >> >> >>> The following patch does the same thing as original, yet >> >> >> >> >>> keeps the format of the virtio. It shouldn't break live >> >> >> >> >>> migration either because inuse should be 0. >> >> >> >> >>> >> >> >> >> >>> Yoshi >> >> >> >> >> >> >> >> >> >> Question is, can you flush to make inuse 0 in kemari too? >> >> >> >> >> And if not, how do you handle the fact that some requests >> >> >> >> >> are in flight on the primary? >> >> >> >> > >> >> >> >> > Although we try flushing requests one by one making inuse 0, >> >> >> >> > there are cases when it failovers to the secondary when inuse >> >> >> >> > isn't 0. We handle these in flight request on the primary by >> >> >> >> > replaying on the secondary. >> >> >> >> > >> >> >> >> >> >> >> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c >> >> >> >> >>> index c8a0fc6..875c7ca 100644 >> >> >> >> >>> --- a/hw/virtio.c >> >> >> >> >>> +++ b/hw/virtio.c >> >> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) >> >> >> >> >>> qemu_put_be32(f, i); >> >> >> >> >>> >> >> >> >> >>> for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) { >> >> >> >> >>> + uint16_t last_avail_idx; >> >> >> >> >>> + >> >> >> >> >>> if (vdev->vq[i].vring.num == 0) >> >> >> >> >>> break; >> >> >> >> >>> >> >> >> >> >>> + last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse; >> >> >> >> >>> + >> >> >> >> >>> qemu_put_be32(f, vdev->vq[i].vring.num); >> >> >> >> >>> qemu_put_be64(f, vdev->vq[i].pa); >> >> >> >> >>> - qemu_put_be16s(f, &vdev->vq[i].last_avail_idx); >> >> >> >> >>> + qemu_put_be16s(f, &last_avail_idx); >> >> >> >> >>> if (vdev->binding->save_queue) >> >> >> >> >>> vdev->binding->save_queue(vdev->binding_opaque, i, f); >> >> >> >> >>> } >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >> >> >> >> >> >> This looks wrong to me. Requests can complete in any order, can they >> >> >> >> >> not? So if request 0 did not complete and request 1 did not, >> >> >> >> >> you send avail - inuse and on the secondary you will process and >> >> >> >> >> complete request 1 the second time, crashing the guest. >> >> >> >> > >> >> >> >> > In case of Kemari, no. We sit between devices and net/block, and >> >> >> >> > queue the requests. After completing each transaction, we flush >> >> >> >> > the requests one by one. So there won't be completion inversion, >> >> >> >> > and therefore won't be visible to the guest. >> >> >> >> > >> >> >> >> > Yoshi >> >> >> >> > >> >> >> >> >> >> >> >> >> >>> >> >> >> >> >>> > >> >> >> >> >>> >> > >> >> >> >> >>> >> >> --- >> >> >> >> >>> >> >> hw/virtio.c | 8 +++++++- >> >> >> >> >>> >> >> 1 files changed, 7 insertions(+), 1 deletions(-) >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c >> >> >> >> >>> >> >> index 849a60f..5509644 100644 >> >> >> >> >>> >> >> --- a/hw/virtio.c >> >> >> >> >>> >> >> +++ b/hw/virtio.c >> >> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue >> >> >> >> >>> >> >> VRing vring; >> >> >> >> >>> >> >> target_phys_addr_t pa; >> >> >> >> >>> >> >> uint16_t last_avail_idx; >> >> >> >> >>> >> >> - int inuse; >> >> >> >> >>> >> >> + uint16_t inuse; >> >> >> >> >>> >> >> uint16_t vector; >> >> >> >> >>> >> >> void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq); >> >> >> >> >>> >> >> VirtIODevice *vdev; >> >> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f) >> >> >> >> >>> >> >> qemu_put_be32(f, vdev->vq[i].vring.num); >> >> >> >> >>> >> >> qemu_put_be64(f, vdev->vq[i].pa); >> >> >> >> >>> >> >> qemu_put_be16s(f, &vdev->vq[i].last_avail_idx); >> >> >> >> >>> >> >> + qemu_put_be16s(f, &vdev->vq[i].inuse); >> >> >> >> >>> >> >> if (vdev->binding->save_queue) >> >> >> >> >>> >> >> vdev->binding->save_queue(vdev->binding_opaque, i, f); >> >> >> >> >>> >> >> } >> >> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f) >> >> >> >> >>> >> >> vdev->vq[i].vring.num = qemu_get_be32(f); >> >> >> >> >>> >> >> vdev->vq[i].pa = qemu_get_be64(f); >> >> >> >> >>> >> >> qemu_get_be16s(f, &vdev->vq[i].last_avail_idx); >> >> >> >> >>> >> >> + qemu_get_be16s(f, &vdev->vq[i].inuse); >> >> >> >> >>> >> >> + >> >> >> >> >>> >> >> + /* revert last_avail_idx if there are outstanding emulation. */ >> >> >> >> >>> >> >> + vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse; >> >> >> >> >>> >> >> + vdev->vq[i].inuse = 0; >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> if (vdev->vq[i].pa) { >> >> >> >> >>> >> >> virtqueue_init(&vdev->vq[i]); >> >> >> >> >>> >> >> -- >> >> >> >> >>> >> >> 1.7.1.2 >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> -- >> >> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> >> >> >> >>> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> >> >> >>> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >>> >> > -- >> >> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in >> >> >> >> >>> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> >> >> >>> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >>> >> > >> >> >> >> >>> > -- >> >> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in >> >> >> >> >>> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> >> >> >>> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >>> > >> >> >> >> >> -- >> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in >> >> >> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> >> >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> >> >> >> >> >> >> > >> >> >> > -- >> >> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in >> >> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> > >> >> > -- >> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in >> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > >> > -- >> > To unsubscribe from this list: send the line "unsubscribe kvm" in >> > the body of a message to majordomo@xxxxxxxxxxxxxxx >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html