Re: Mask bit support's API

"Michael S. Tsirkin" <mst@xxxxxxxxxx> · Thu, 2 Dec 2010 15:47:19 +0200

On Thu, Dec 02, 2010 at 03:09:43PM +0200, Avi Kivity wrote:
> On 12/01/2010 04:36 AM, Yang, Sheng wrote:
> >On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote:
> >>  On 11/26/2010 04:35 AM, Yang, Sheng wrote:
> >>  >  >   >   Shouldn't kvm also service reads from the pending bitmask?
> >>  >  >
> >>  >  >   Of course KVM should service reading from pending bitmask. For
> >>  >  >   assigned device, it's kernel who would set the pending bit; but I am
> >>  >  >   not sure for virtio. This interface is GET_ENTRY, so reading is fine
> >>  >  >   with it.
> >>
> >>  The kernel should manage it in the same way.  Virtio raises irq (via
> >>  KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.
> >>
> >>  Note we need to be able to read and write the pending bitmask for live
> >>  migration.
> >
> >Then seems we still need to an writing interface for it. And I think we can work
> >on it later since it got no user now.
> 
> I'm not sure.  Suppose a guest starts using pending bits.  Does it
> mean it will not work with kernel msix acceleration?
> 
> If that is the case, then we need pending bit support now.  I don't
> want to knowingly merge something that doesn't conform to the spec,
> forcing users to choose whether they want to enable kernel msix
> acceleration or not.
> 
> 
> 
> >>
> >>  Because we have an interface where you get an exit if (addr % 4)<  3 and
> >>  don't get an exit if (addr % 4) == 3.  There is a gpa range which is
> >>  partially maintained by the kernel and partially in userspace.  It's a
> >>  confusing interface.  Things like 64-bit reads or writes need to be
> >>  broken up and serviced in two different places.
> >>
> >>  We already need to support this (for unaligned writes which hit two
> >>  regions), but let's at least make a contiguous region behave sanely.
> >
> >Oh, I didn't mean to handle this kind of unaligned writing in userspace. They're
> >illegal according to the PCI spec(otherwise the result is undefined according to
> >the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next
> >version, and discard all illegal writing. And 64-bit accessing would be broken up
> >in qemu as you said, as they do currently.
> >
> >In fact I think we can handle all data for 64-bit to qemu, because it should still
> >sync the mask bit with kernel, which make the maskbit writing in userspace useless
> >and can be ignored.
> 
> What about reads?  Split those as well?
> 
> >>  The reason is to keep a sane interface.  Like we emulate instructions
> >>  and msrs in the kernel and don't do half a job.  I don't think there's a
> >>  real need to accelerate the first three words of an msi-x entry.
> >
> >Here is the case we've observed with Xen. It can only be reproduced by large scale
> >testing. When the interrupt intensity is very high, even new kernels would try to
> >make it lower, then it would access mask bit very frequently. And in the kernel,
> >msi_set_mask_bit() is like this:
> >
> >static void msi_set_mask_bit(struct irq_data *data, u32 flag)
> >{
> >         struct msi_desc *desc = irq_data_get_msi(data);
> >
> >         if (desc->msi_attrib.is_msix) {
> >                 msix_mask_irq(desc, flag);
> >                 readl(desc->mask_base);         /* Flush write to device */
> >         } else {
> >                 unsigned offset = data->irq - desc->dev->irq;
> >                 msi_mask_irq(desc, 1<<  offset, flag<<  offset);
> >         }
> >}
> >
> >That flush by readl() would complied with each MSI-X mask writing. That is the only
> >place we want to accelerate, otherwise it would still fall back to qemu each time
> >when guest want to mask the MSI-X entry.
> 
> So it seems we do want to accelerate reads to the entire entry.
> This seems to give more weight to making the kernel own the entire
> entry.
> 
> >>
> >>  >  And BTW, we can take routing table as a kind of *cache*, if the content
> >>  >  is in the cache, then we can fetch it from the cache, otherwise we need
> >>  >  to go back to fetch it from memory(userspace).
> >>
> >>  If it's guaranteed by the spec that addr/data pairs are always
> >>  interpreted in the same way, sure.  But there no reason to do it,
> >>  really, it isn't a fast path.
> >
> >I am not quite understand the first sentence... But it's a fast path, in the case I
> >showed above.
> 
> Which case?  the readl() doesn't need access to the routing table,
> just the entry.

One thing that read should do is flush in the outstanding
interrupts and flush out the mask bit writes.

> Oh, I think there is a terminology problem, I was talking about
> kvm's irq routing table, you were talking about the msix entries.
> 
> I think treating it as a cache causes more problems, since there are
> now two paths for reads (in cache and not in cache) and more things
> for writes to manage.
> 
> Here's my proposed API:
> 
> KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa,
> pending_bitmap_base_gpa)
> 
>  - called when the guest enables msix

I would add virtual addresses so that we can use swappable memory to
store the state.
If we do, maybe we can just keep the table there and then
KVM_SET/GET_MSIX_ENTRY and the new exit won't be needed?

> KVM_REMOVE_MSIX_TABLE(table_id)
> 
>   - called when the guest disables msix
> 
> KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)
> 
>   - called when the guest enables msix (to initialize it), or after
> live migration

What is entry_id here?

> KVM_GET_MSIX_ENTRY(table_id, entry_id, contents)
> 
>   - called when the guest updates an msix table (triggered by
> KVM_EXIT_MSIX_ENTRY_UPDATED)
> 
> KVM_RUN -> KVM_EXIT_MSIX_ENTRY_UPDATED
> 
>   - returned from KVM_RUN after the guest touches the first three
> words of an msix entry
> 
> KVM_SET_MSIX_ENTRY_IRQ(table_id, entry_id, msix_irq_desc)
> 
> msix_irq_desc could describe an assigned device irq, or gsi that is
> mapped to an msix interrupt in the kvm routing table.
> 
> Michael?  I think that should work for virtio and vfio assigned
> devices?  Not sure about pending bits.

Pending bits must be tracked in kernel, but I don't see
how we can support polling mode if we don't exit to userspace
on pending bit reads.

This does mean that some reads will be fast and some will be
slow, and it's a bit sad that we seem to be optimizing
for specific guests, but I just can't come up with
anything better.

So maybe just add an ioctl to get and to clear pending bits.
Maybe set for symmetry.

> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html