Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Fri, 24 Dec 2010 15:21:35 +0200

On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> 2010/12/24 Michael S. Tsirkin <mst@xxxxxxxxxx>:
> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/16 Michael S. Tsirkin <mst@xxxxxxxxxx>:
> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/12/16 Michael S. Tsirkin <mst@xxxxxxxxxx>:
> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx>:
> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@xxxxxxxxxx>:
> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@xxxxxxxxxx>:
> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@xxxxxxxxxx>:
> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx>
> >> >> >> >>> >> >
> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >> >> >>> >> > state that is not guest visible is always a mistake
> >> >> >> >>> >> > as it ties migration format to an internal implementation
> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >> >> >>> >> > is to flush outstanding
> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >> >> >>> >>
> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >> >> >>> >> happy to drop this patch.
> >> >> >> >>> >>
> >> >> >> >>> >> Yoshi
> >> >> >> >>> >
> >> >> >> >>> > Look for this:
> >> >> >> >>> > stable migration image on a stopped vm
> >> >> >> >>> > sent on:
> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >> >> >>>
> >> >> >> >>> Thanks for the info.
> >> >> >> >>>
> >> >> >> >>> However, The patch series above didn't solve the issue.  In
> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >> >> >>> output, and while last_avail_idx gets incremented
> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >> >> >>> between Primary and Secondary.
> >> >> >> >>
> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >> >> >
> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> >> >> > and update the external when inuse is decremented.  I'll try
> >> >> >> > whether it work w/ w/o Kemari.
> >> >> >>
> >> >> >> Hi Michael,
> >> >> >>
> >> >> >> Could you please take a look at the following patch?
> >> >> >
> >> >> > Which version is this against?
> >> >>
> >> >> Oops.  It should be very old.
> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
> >> >>
> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx>
> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >> >> >>
> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
> >> >> >>
> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx>
> >> >> >
> >> >> > It would be better to have a commit description explaining why a change
> >> >> > is made, and why it is correct, not just repeating what can be seen from
> >> >> > the diff anyway.
> >> >>
> >> >> Sorry for being lazy here.
> >> >>
> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> index c8a0fc6..6688c02 100644
> >> >> >> --- a/hw/virtio.c
> >> >> >> +++ b/hw/virtio.c
> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >> >> >>      wmb();
> >> >> >>      trace_virtqueue_flush(vq, count);
> >> >> >>      vring_used_idx_increment(vq, count);
> >> >> >> +    vq->last_avail_idx += count;
> >> >> >>      vq->inuse -= count;
> >> >> >>  }
> >> >> >>
> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >>      unsigned int i, head, max;
> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >> >> >>
> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >> >> >>          return 0;
> >> >> >>
> >> >> >>      /* When we start there are none of either input nor output. */
> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >>
> >> >> >>      max = vq->vring.num;
> >> >> >>
> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >> >> >>
> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >> >> >>
> >> >> >
> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> >> >>
> >> >> I think there are two problems.
> >> >>
> >> >> 1. When to update last_avail_idx.
> >> >> 2. The ordering issue you're mentioning below.
> >> >>
> >> >> The patch above is only trying to address 1 because last time you
> >> >> mentioned that modifying last_avail_idx upon save may break the
> >> >> guest, which I agree.  If virtio_queue_empty and
> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
> >> >> to the guest, I guess the approach above can be applied too.
> >> >
> >> > So IMHO 2 is the real issue. This is what was problematic
> >> > with the save patch, otherwise of course changes in save
> >> > are better than changes all over the codebase.
> >>
> >> All right.  Then let's focus on 2 first.
> >>
> >> >> > Previous patch version sure looked simpler, and this seems functionally
> >> >> > equivalent, so my question still stands: here it is rephrased in a
> >> >> > different way:
> >> >> >
> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >> >> >
> >> >> >        host pops A, then B, then completes B and flushes
> >> >> >
> >> >> >        now with this patch last_avail_idx will be 1, and then
> >> >> >        remote will get it, it will execute B again. As a result
> >> >> >        B will complete twice, and apparently A will never complete.
> >> >> >
> >> >> >
> >> >> > This is what I was saying below: assuming that there are
> >> >> > outstanding requests when we migrate, there is no way
> >> >> > a single index can be enough to figure out which requests
> >> >> > need to be handled and which are in flight already.
> >> >> >
> >> >> > We must add some kind of bitmask to tell us which is which.
> >> >>
> >> >> I should understand why this inversion can happen before solving
> >> >> the issue.
> >> >
> >> > It's a fundamental thing in virtio.
> >> > I think it is currently only likely to happen with block, I think tap
> >> > currently completes things in order.  In any case relying on this in the
> >> > frontend is a mistake.
> >> >
> >> >>  Currently, how are you making virio-net to flush
> >> >> every requests for live migration?  Is it qemu_aio_flush()?
> >> >
> >> > Think so.
> >>
> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> As I described in the previous message, Kemari queues the
> >> requests first.  So in you example above, it should start with
> >>
> >> virtio-net: last_avai_idx 0 inuse 2
> >> event-tap: {A,B}
> >>
> >> As you know, the requests are still in order still because net
> >> layer initiates in order.  Not about completing.
> >>
> >> In the first synchronization, the status above is transferred.  In
> >> the next synchronization, the status will be as following.
> >>
> >> virtio-net: last_avai_idx 1 inuse 1
> >> event-tap: {B}
> >
> > OK, this answers the ordering question.
> 
> Glad to hear that!
> 
> > Another question: at this point we transfer this status: both
> > event-tap and virtio ring have the command B,
> > so the remote will have:
> >
> > virtio-net: inuse 0
> > event-tap: {B}
> >
> > Is this right? This already seems to be a problem as when B completes
> > inuse will go negative?
> 
> I think state above is wrong.  inuse 0 means there shouldn't be
> any requests in event-tap.  Note that the callback is called only
> when event-tap flushes the requests.
> 
> > Next it seems that the remote virtio will resubmit B to event-tap. The
> > remote will then have:
> >
> > virtio-net: inuse 1
> > event-tap: {B, B}
> >
> > This looks kind of wrong ... will two packets go out?
> 
> No.  Currently, we're just replaying the requests with pio/mmio.
> In the situation above, it should be,
> 
> virtio-net: inuse 1
> event-tap: {B}
> 
> >> Why? Because Kemari flushes the first virtio-net request using
> >> qemu_aio_flush() before each synchronization.  If
> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
> >> should be problematic.  So in the final synchronization, the
> >> state should be,
> >>
> >> virtio-net: last_avai_idx 2 inuse 0
> >> event-tap: {}
> >>
> >> where A,B were completed in order.
> >>
> >> Yoshi
> >
> >
> > It might be better to discuss block because that's where
> > requests can complete out of order.
> 
> It's same as net.  We queue requests and call bdrv_flush per
> sending requests to the block.  So there shouldn't be any
> inversion.
> 
> > So let me see if I understand:
> > - each command passed to event tap is queued by it,
> >  it is not passed directly to the backend
> > - later requests are passed to the backend,
> >  always in the same order that they were submitted
> > - each synchronization point flushes all requests
> >  passed to the backend so far
> > - each synchronization transfers all requests not passed to the backend,
> >  to the remote, and they are replayed there
> 
> Correct.
> 
> > Now to analyse this for correctness I am looking at the original patch
> > because it is smaller so easier to analyse and I think it is
> > functionally equivalent, correct me if I am wrong in this.
> 
> So you think decreasing last_avail_idx upon save is better than
> updating it in the callback?

If this is correct, of the two equivalent approaches the one
that only touches save/load seems superiour.

> > So the reason there's no out of order issue is this
> > (and might be a good thing to put in commit log
> > or a comment somewhere):
> 
> I've done some in the latest patch.  Please point it out if it
> wasn't enough.
> 
> > At point of save callback event tap has flushed commands
> > passed to the backend already. Thus at the point of
> > the save callback if a command has completed
> > all previous commands have been flushed and completed.
> >
> >
> > Therefore inuse is
> > in fact the # of requests passed to event tap but not yet
> > passed to the backend (for non-event tap case all commands are
> > passed to the backend immediately and because of this
> > inuse is 0) and these are the last inuse commands submitted.
> >
> >
> > Right?
> 
> Yep.
> 
> > Now a question:
> >
> > When we pass last_used_index - inuse to the remote,
> > the remote virtio will resubmit the request.
> > Since request is also passed by event tap, we get
> > the request twice, why is this not a problem?
> 
> It's not a problem because event-tap currently replays with
> pio/mmio only, as I mentioned above.  Although event-tap receives
> information about the queued requests, it won't pass it to the
> backend.  The reason is the problem in setting the callbacks
> which are specific to devices on the secondary.  These are
> pointers, and even worse, are usually static functions, which
> event-tap has no way to restore it upon failover.  I do want to
> change event-tap replay to be this way in the future, pio/mmio
> replay is implemented for now.
> 
> Thanks,
> 
> Yoshi
> 
> >
> >
> >> >
> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >>>  I'm wondering why
> >> >> >> >>> last_avail_idx is OK to send but not inuse.
> >> >> >> >>
> >> >> >> >> last_avail_idx is at some level a mistake, it exposes part of
> >> >> >> >> our internal implementation, but it does *also* express
> >> >> >> >> a guest observable state.
> >> >> >> >>
> >> >> >> >> Here's the problem that it solves: just looking at the rings in virtio
> >> >> >> >> there is no way to detect that a specific request has already been
> >> >> >> >> completed. And the protocol forbids completing the same request twice.
> >> >> >> >>
> >> >> >> >> Our implementation always starts processing the requests
> >> >> >> >> in order, and since we flush outstanding requests
> >> >> >> >> before save, it works to just tell the remote 'process only requests
> >> >> >> >> after this place'.
> >> >> >> >>
> >> >> >> >> But there's no such requirement in the virtio protocol,
> >> >> >> >> so to be really generic we could add a bitmask of valid avail
> >> >> >> >> ring entries that did not complete yet. This would be
> >> >> >> >> the exact representation of the guest observable state.
> >> >> >> >> In practice we have rings of up to 512 entries.
> >> >> >> >> That's 64 byte per ring, not a lot at all.
> >> >> >> >>
> >> >> >> >> However, if we ever do change the protocol to send the bitmask,
> >> >> >> >> we would need some code to resubmit requests
> >> >> >> >> out of order, so it's not trivial.
> >> >> >> >>
> >> >> >> >> Another minor mistake with last_avail_idx is that it has
> >> >> >> >> some redundancy: the high bits in the index
> >> >> >> >> (> vq size) are not necessary as they can be
> >> >> >> >> got from avail idx.  There's a consistency check
> >> >> >> >> in load but we really should try to use formats
> >> >> >> >> that are always consistent.
> >> >> >> >>
> >> >> >> >>> The following patch does the same thing as original, yet
> >> >> >> >>> keeps the format of the virtio.  It shouldn't break live
> >> >> >> >>> migration either because inuse should be 0.
> >> >> >> >>>
> >> >> >> >>> Yoshi
> >> >> >> >>
> >> >> >> >> Question is, can you flush to make inuse 0 in kemari too?
> >> >> >> >> And if not, how do you handle the fact that some requests
> >> >> >> >> are in flight on the primary?
> >> >> >> >
> >> >> >> > Although we try flushing requests one by one making inuse 0,
> >> >> >> > there are cases when it failovers to the secondary when inuse
> >> >> >> > isn't 0.  We handle these in flight request on the primary by
> >> >> >> > replaying on the secondary.
> >> >> >> >
> >> >> >> >>
> >> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >>> index c8a0fc6..875c7ca 100644
> >> >> >> >>> --- a/hw/virtio.c
> >> >> >> >>> +++ b/hw/virtio.c
> >> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>>      qemu_put_be32(f, i);
> >> >> >> >>>
> >> >> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >> >> >> >>> +        uint16_t last_avail_idx;
> >> >> >> >>> +
> >> >> >> >>>          if (vdev->vq[i].vring.num == 0)
> >> >> >> >>>              break;
> >> >> >> >>>
> >> >> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >> >> >> >>> +
> >> >> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
> >> >> >> >>>          if (vdev->binding->save_queue)
> >> >> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >> >>>      }
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >> This looks wrong to me.  Requests can complete in any order, can they
> >> >> >> >> not?  So if request 0 did not complete and request 1 did not,
> >> >> >> >> you send avail - inuse and on the secondary you will process and
> >> >> >> >> complete request 1 the second time, crashing the guest.
> >> >> >> >
> >> >> >> > In case of Kemari, no.  We sit between devices and net/block, and
> >> >> >> > queue the requests.  After completing each transaction, we flush
> >> >> >> > the requests one by one.  So there won't be completion inversion,
> >> >> >> > and therefore won't be visible to the guest.
> >> >> >> >
> >> >> >> > Yoshi
> >> >> >> >
> >> >> >> >>
> >> >> >> >>>
> >> >> >> >>> >
> >> >> >> >>> >> >
> >> >> >> >>> >> >> ---
> >> >> >> >>> >> >>  hw/virtio.c |    8 +++++++-
> >> >> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >>> >> >> index 849a60f..5509644 100644
> >> >> >> >>> >> >> --- a/hw/virtio.c
> >> >> >> >>> >> >> +++ b/hw/virtio.c
> >> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >> >> >>> >> >>      VRing vring;
> >> >> >> >>> >> >>      target_phys_addr_t pa;
> >> >> >> >>> >> >>      uint16_t last_avail_idx;
> >> >> >> >>> >> >> -    int inuse;
> >> >> >> >>> >> >> +    uint16_t inuse;
> >> >> >> >>> >> >>      uint16_t vector;
> >> >> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >> >> >>> >> >>      VirtIODevice *vdev;
> >> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >> >> >>> >> >>          if (vdev->binding->save_queue)
> >> >> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >> >>> >> >>      }
> >> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >> >> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >> >> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >> >> >>> >> >> +
> >> >> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> >> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> >> >> >>> >> >> +        vdev->vq[i].inuse = 0;
> >> >> >> >>> >> >>
> >> >> >> >>> >> >>          if (vdev->vq[i].pa) {
> >> >> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
> >> >> >> >>> >> >> --
> >> >> >> >>> >> >> 1.7.1.2
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> --
> >> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> >> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >> > --
> >> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> >> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >> >
> >> >> >> >>> > --
> >> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> >> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >
> >> >> >> >> --
> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >> the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>
> >> >> >> >
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html