Re: Mask bit support's API

Avi Kivity <avi@xxxxxxxxxx> · Thu, 02 Dec 2010 15:09:43 +0200

On 12/01/2010 04:36 AM, Yang, Sheng wrote:
On Tuesday 30 November 2010 22:15:29 Avi Kivity wrote:
>  On 11/26/2010 04:35 AM, Yang, Sheng wrote:
>  >  >   >   Shouldn't kvm also service reads from the pending bitmask?
>  >  >
>  >  >   Of course KVM should service reading from pending bitmask. For
>  >  >   assigned device, it's kernel who would set the pending bit; but I am
>  >  >   not sure for virtio. This interface is GET_ENTRY, so reading is fine
>  >  >   with it.
>
>  The kernel should manage it in the same way.  Virtio raises irq (via
>  KVM_IRQ_LINE or vhost-net's irqfd), kernel sets pending bit.
>
>  Note we need to be able to read and write the pending bitmask for live
>  migration.

Then seems we still need to an writing interface for it. And I think we can work
on it later since it got no user now.

I'm not sure.  Suppose a guest starts using pending bits.  Does it mean 
it will not work with kernel msix acceleration?

If that is the case, then we need pending bit support now.  I don't want 
to knowingly merge something that doesn't conform to the spec, forcing 
users to choose whether they want to enable kernel msix acceleration or not.

>
>  Because we have an interface where you get an exit if (addr % 4)<  3 and
>  don't get an exit if (addr % 4) == 3.  There is a gpa range which is
>  partially maintained by the kernel and partially in userspace.  It's a
>  confusing interface.  Things like 64-bit reads or writes need to be
>  broken up and serviced in two different places.
>
>  We already need to support this (for unaligned writes which hit two
>  regions), but let's at least make a contiguous region behave sanely.

Oh, I didn't mean to handle this kind of unaligned writing in userspace. They're
illegal according to the PCI spec(otherwise the result is undefined according to
the spec). I would cover all contiguous writing(32-bit and 64-bit) in the next
version, and discard all illegal writing. And 64-bit accessing would be broken up
in qemu as you said, as they do currently.

In fact I think we can handle all data for 64-bit to qemu, because it should still
sync the mask bit with kernel, which make the maskbit writing in userspace useless
and can be ignored.

What about reads?  Split those as well?

>  The reason is to keep a sane interface.  Like we emulate instructions
>  and msrs in the kernel and don't do half a job.  I don't think there's a
>  real need to accelerate the first three words of an msi-x entry.

Here is the case we've observed with Xen. It can only be reproduced by large scale
testing. When the interrupt intensity is very high, even new kernels would try to
make it lower, then it would access mask bit very frequently. And in the kernel,
msi_set_mask_bit() is like this:

static void msi_set_mask_bit(struct irq_data *data, u32 flag)
{
         struct msi_desc *desc = irq_data_get_msi(data);

         if (desc->msi_attrib.is_msix) {
                 msix_mask_irq(desc, flag);
                 readl(desc->mask_base);         /* Flush write to device */
         } else {
                 unsigned offset = data->irq - desc->dev->irq;
                 msi_mask_irq(desc, 1<<  offset, flag<<  offset);
         }
}

That flush by readl() would complied with each MSI-X mask writing. That is the only
place we want to accelerate, otherwise it would still fall back to qemu each time
when guest want to mask the MSI-X entry.

So it seems we do want to accelerate reads to the entire entry.  This 
seems to give more weight to making the kernel own the entire entry.

>
>  >  And BTW, we can take routing table as a kind of *cache*, if the content
>  >  is in the cache, then we can fetch it from the cache, otherwise we need
>  >  to go back to fetch it from memory(userspace).
>
>  If it's guaranteed by the spec that addr/data pairs are always
>  interpreted in the same way, sure.  But there no reason to do it,
>  really, it isn't a fast path.

I am not quite understand the first sentence... But it's a fast path, in the case I
showed above.

Which case?  the readl() doesn't need access to the routing table, just 
the entry.

Oh, I think there is a terminology problem, I was talking about kvm's 
irq routing table, you were talking about the msix entries.

I think treating it as a cache causes more problems, since there are now 
two paths for reads (in cache and not in cache) and more things for 
writes to manage.

Here's my proposed API:

KVM_DEFINE_MSIX_TABLE(table_id, nr_entries, msix_base_gpa, 
pending_bitmap_base_gpa)

 - called when the guest enables msix

KVM_REMOVE_MSIX_TABLE(table_id)

  - called when the guest disables msix

KVM_SET_MSIX_ENTRY(table_id, entry_id, contents)

  - called when the guest enables msix (to initialize it), or after 
live migration

KVM_GET_MSIX_ENTRY(table_id, entry_id, contents)

  - called when the guest updates an msix table (triggered by 
KVM_EXIT_MSIX_ENTRY_UPDATED)

KVM_RUN -> KVM_EXIT_MSIX_ENTRY_UPDATED

  - returned from KVM_RUN after the guest touches the first three words 
of an msix entry

KVM_SET_MSIX_ENTRY_IRQ(table_id, entry_id, msix_irq_desc)

msix_irq_desc could describe an assigned device irq, or gsi that is 
mapped to an msix interrupt in the kvm routing table.

Michael?  I think that should work for virtio and vfio assigned 
devices?  Not sure about pending bits.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html