Re: [PATCH] vfio-pci: Add KVM INTx acceleration

Alex Williamson <alex.williamson@xxxxxxxxxx> · Tue, 16 Oct 2012 08:48:04 -0600

On Tue, 2012-10-16 at 16:14 +0200, Michael S. Tsirkin wrote:
> On Tue, Oct 16, 2012 at 07:51:43AM -0600, Alex Williamson wrote:
> > On Tue, 2012-10-16 at 08:39 +0200, Michael S. Tsirkin wrote:
> > > On Mon, Oct 15, 2012 at 02:28:15PM -0600, Alex Williamson wrote:
> > > > This makes use of the new level irqfd support enabling bypass of
> > > > qemu userspace both on INTx injection and unmask.  This significantly
> > > > boosts the performance of devices making use of legacy interrupts.
> > > > 
> > > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > > > ---
> > > > 
> > > > My INTx routing workaround below will probably raise some eyebrows,
> > > > but I don't feel it's worth subjecting users to core dumps if they
> > > > want to try vfio-pci on new platforms.  INTx routing is part of some
> > > > larger plan, but until that plan materializes we have to try to avoid
> > > > the API unless we think there's a good chance it might be there.
> > > > I'll accept the maintenance of updating a whitelist in the interim.
> > > > Thanks,
> > > > 
> > > > Alex
> > > > 
> > > >  hw/vfio_pci.c |  224 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 224 insertions(+)
> > > > 
> > > > diff --git a/hw/vfio_pci.c b/hw/vfio_pci.c
> > > > index 639371e..777a5f8 100644
> > > > --- a/hw/vfio_pci.c
> > > > +++ b/hw/vfio_pci.c
> > > > @@ -154,6 +154,53 @@ static uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len);
> > > >  static void vfio_mmap_set_enabled(VFIODevice *vdev, bool enabled);
> > > >  
> > > >  /*
> > > > + * PCI code refuses to make it possible to probe whether the chipset
> > > > + * supports pci_device_route_intx_to_irq() and booby traps the call
> > > > + * to assert if doesn't.  For us, this is just an optimization, so
> > > > + * only enable it when we know it's present.  Unfortunately PCIBus is
> > > > + * private, so we can't just look at the function pointer.
> > > > + */
> > > > +static bool vfio_pci_bus_has_intx_route(PCIDevice *pdev)
> > > > +{
> > > > +#ifdef CONFIG_KVM
> > > > +    BusState *bus = qdev_get_parent_bus(&pdev->qdev);
> > > > +    DeviceState *dev;
> > > > +
> > > > +    if (!kvm_irqchip_in_kernel() ||
> > > > +        !kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
> > > > +	return false;
> > > > +    }
> > > 
> > > 
> > > Shouldn't we update linux-headers/ to get KVM_CAP_IRQFD_RESAMPLE?
> > > Also for KVM_IRQFD_FLAG_RESAMPLE.
> > 
> > I posted the patch for that separately yesterday.  I'll only request a
> > pull once that's in.
> 
> OK missed that. In the future, might be a good idea to note dependencies
> in the patchset or repost them as part of patchset with appropriate
> tagging.
> 
> > > > +
> > > > +    for (; bus->parent; bus = qdev_get_parent_bus(dev)) {
> > > > +
> > > > +        dev = bus->parent;
> > > > +
> > > > +        if (!strncmp("i440FX-pcihost", object_get_typename(OBJECT(dev)), 14)) {
> > > > +            return true;
> > > > +        }
> > > > +    }
> > > > +
> > > > +    error_report("vfio-pci: VM chipset does not support INTx routing, "
> > > > +                 "using slow INTx mode\n");
> > > 
> > > When does this code trigger? It seems irqchip implies piix ATM -
> > > is this just dead code?
> > 
> > Unused, but not unnecessary.  Another chipset is under development,
> > which means very quickly irqchip will not imply piix.
> 
> So this is for purposes of an out of tree stuff, let's
> keep these hacks out of tree too. My guess is
> q35 can just be added with pci_device_route_intx straight away.
> 
> >  Likewise irqfd
> > support is being added to other architectures, so I don't know how long
> > the kvm specific tests will hold up.  Testing for a specific chipset
> > could of course be avoided if we were willing to support:
> > 
> > bool pci_device_intx_route_supported(PCIDevice *pdev)
> > 
> > or the NOROUTE option I posted previously.
> 
> This is just moving the pain to users who will
> get bad performance and will have to debug it. Injecting
> through userspace is slow, new architectures should
> simply add irqfd straight away instead of adding
> work arounds in userspace.
> 
> This is exactly why we have the assert in pci core.

Let's compare user experience:

As coded here:

- Error message, only runs slower if the driver actually uses INTx.
Result: file bug, continue using vfio-pci, maybe do something useful,
maybe find new bugs to file.

Blindly calling PCI code w/ assert:

- Assert.  Result: file bug, full stop.

Maybe I do too much kernel programming, but I don't take asserts lightly
and prefer they be saved for "something is really broken and I can't
safely continue".  This is not such a case.

> > > > +#endif
> > > > +    return false;
> > > > +}
> > > > +
> > > > +static PCIINTxRoute vfio_pci_device_route_intx_to_irq(PCIDevice *pdev, int pin)
> > > > +{
> > > > +    if (!vfio_pci_bus_has_intx_route(pdev)) {
> > > > +        return (PCIINTxRoute) { .mode = PCI_INTX_DISABLED, .irq = -1 };
> > > > +    }
> > > > +
> > > > +    return pci_device_route_intx_to_irq(pdev, pin);
> > > > +}
> > > > +
> > > > +static bool vfio_pci_intx_route_changed(PCIINTxRoute *old, PCIINTxRoute *new)
> > > > +{
> > > > +    return old->mode != new->mode || old->irq != new->irq;
> > > > +}
> > > > +
> > > 
> > > Didn't you add an API for this? It's on pci branch but I can drop
> > > it if not needed.
> > 
> > I did and I'll switch to it when available, but I have no idea when that
> > will be, so I've hedged my bets by re-implementing it here.  2 week+
> > turnover for a patch makes it difficult to coordinate dependent changes
> > on short qemu release cycles.
> 
> It's available on pci branch, please base on that instead of master.
> Yes I merge at about 2 week intervals but the point is using
> subtrees. You do not need to block waiting for stuff to land
> in master.

If it wasn't such a trivial function to re-implement I'd do as you
suggest, but note that I'm still blocked doing a pull request to master
until you get your changes merged.  There are only about 8 weeks of open
qemu 1.3 development, so a 2 week interval doesn't allow many cycles.

> > > > +/*
> > > >   * Common VFIO interrupt disable
> > > >   */
> > > >  static void vfio_disable_irqindex(VFIODevice *vdev, int index)
> > > > @@ -185,6 +232,21 @@ static void vfio_unmask_intx(VFIODevice *vdev)
> > > >      ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> > > >  }
> > > >  
> > > > +#ifdef CONFIG_KVM
> > > > +static void vfio_mask_intx(VFIODevice *vdev)
> > > > +{
> > > > +    struct vfio_irq_set irq_set = {
> > > > +        .argsz = sizeof(irq_set),
> > > > +        .flags = VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_MASK,
> > > > +        .index = VFIO_PCI_INTX_IRQ_INDEX,
> > > > +        .start = 0,
> > > > +        .count = 1,
> > > > +    };
> > > > +
> > > > +    ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
> > > > +}
> > > > +#endif
> > > > +
> > > >  /*
> > > >   * Disabling BAR mmaping can be slow, but toggling it around INTx can
> > > >   * also be a huge overhead.  We try to get the best of both worlds by
> > > > @@ -248,6 +310,161 @@ static void vfio_eoi(VFIODevice *vdev)
> > > >      vfio_unmask_intx(vdev);
> > > >  }
> > > >  
> > > > +static void vfio_enable_intx_kvm(VFIODevice *vdev)
> > > > +{
> > > > +#ifdef CONFIG_KVM
> > > > +    struct kvm_irqfd irqfd = {
> > > > +        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
> > > > +        .gsi = vdev->intx.route.irq,
> > > > +        .flags = KVM_IRQFD_FLAG_RESAMPLE,
> > > 
> > > 
> > > Should not kvm ioctl handling be localized in kvm-all.c?
> > > E.g. extend kvm_irqchip_add_irqfd_notifier in
> > > some way? Same question for KVM_CAP_IRQFD_RESAMPLE use above ...
> > 
> > Possibly, I took a survey of existing code and found examples of both.
> 
> These are bugs really - this prevents adding a device to libhw
> which slows build for everyone. So pls do not copy such examples.
> 
> > Therefore I decided to host it myself first and push it out to common
> > helpers later, which will also help me get rid of the #ifdefs here.
> > Thanks,
> > 
> > Alex
> 
> I think it is better to put it in the final place straight away.

Timing doesn't readily allow for that.  The qemu 1.3 soft freeze is only
about 2 weeks away and I'm on vacation for most of that time.  Adding it
locally has precedence, provides acceleration to users now, and adds an
in-tree user when I start splitting it out later.  Given my history, I'm
just as likely to get push back for adding an API without a user as I am
for adding a user prior to the API.  Thanks,

Alex

> > > > +    };
> > > > +    struct vfio_irq_set *irq_set;
> > > > +    int ret, argsz;
> > > > +    int32_t *pfd;
> > > > +
> > > > +    if (!kvm_irqchip_in_kernel() ||
> > > > +        vdev->intx.route.mode != PCI_INTX_ENABLED ||
> > > > +        !kvm_check_extension(kvm_state, KVM_CAP_IRQFD_RESAMPLE)) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    /* Get to a known interrupt state */
> > > > +    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
> > > > +    vfio_mask_intx(vdev);
> > > > +    vdev->intx.pending = false;
> > > > +    qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
> > > > +
> > > > +    /* Get an eventfd for resample/unmask */
> > > > +    if (event_notifier_init(&vdev->intx.unmask, 0)) {
> > > > +        error_report("vfio: Error: event_notifier_init failed eoi\n");
> > > > +        goto fail;
> > > > +    }
> > > > +
> > > > +    /* KVM triggers it, VFIO listens for it */
> > > > +    irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
> > > > +
> > > > +    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
> > > > +        error_report("vfio: Error: Failed to setup resample irqfd: %m\n");
> > > > +        goto fail_irqfd;
> > > > +    }
> > > > +
> > > > +    argsz = sizeof(*irq_set) + sizeof(*pfd);
> > > > +
> > > > +    irq_set = g_malloc0(argsz);
> > > > +    irq_set->argsz = argsz;
> > > > +    irq_set->flags = VFIO_IRQ_SET_DATA_EVENTFD | VFIO_IRQ_SET_ACTION_UNMASK;
> > > > +    irq_set->index = VFIO_PCI_INTX_IRQ_INDEX;
> > > > +    irq_set->start = 0;
> > > > +    irq_set->count = 1;
> > > > +    pfd = (int32_t *)&irq_set->data;
> > > > +
> > > > +    *pfd = irqfd.resamplefd;
> > > > +
> > > > +    ret = ioctl(vdev->fd, VFIO_DEVICE_SET_IRQS, irq_set);
> > > > +    g_free(irq_set);
> > > > +    if (ret) {
> > > > +        error_report("vfio: Error: Failed to setup INTx unmask fd: %m\n");
> > > > +        goto fail_vfio;
> > > > +    }
> > > > +
> > > > +    /* Let'em rip */
> > > > +    vfio_unmask_intx(vdev);
> > > > +
> > > > +    vdev->intx.kvm_accel = true;
> > > > +
> > > > +    DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel enabled\n",
> > > > +            __func__, vdev->host.domain, vdev->host.bus,
> > > > +            vdev->host.slot, vdev->host.function);
> > > > +
> > > > +    return;
> > > > +
> > > > +fail_vfio:
> > > > +    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
> > > > +    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
> > > > +fail_irqfd:
> > > > +    event_notifier_cleanup(&vdev->intx.unmask);
> > > > +fail:
> > > > +    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
> > > > +    vfio_unmask_intx(vdev);
> > > > +#endif
> > > > +}
> > > > +
> > > > +static void vfio_disable_intx_kvm(VFIODevice *vdev)
> > > > +{
> > > > +#ifdef CONFIG_KVM
> > > > +    struct kvm_irqfd irqfd = {
> > > > +        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
> > > > +        .gsi = vdev->intx.route.irq,
> > > > +        .flags = KVM_IRQFD_FLAG_DEASSIGN,
> > > > +    };
> > > > +
> > > > +    if (!vdev->intx.kvm_accel) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    /*
> > > > +     * Get to a known state, hardware masked, QEMU ready to accept new
> > > > +     * interrupts, QEMU IRQ de-asserted.
> > > > +     */
> > > > +    vfio_mask_intx(vdev);
> > > > +    vdev->intx.pending = false;
> > > > +    qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
> > > > +
> > > > +    /* Tell KVM to stop listening for an INTx irqfd */
> > > > +    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
> > > > +        error_report("vfio: Error: Failed to disable INTx irqfd: %m\n");
> > > > +    }
> > > > +
> > > > +    /* We only need to close the eventfd for VFIO to cleanup the kernel side */
> > > > +    event_notifier_cleanup(&vdev->intx.unmask);
> > > > +
> > > > +    /* QEMU starts listening for interrupt events. */
> > > > +    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
> > > > +
> > > > +    vdev->intx.kvm_accel = false;
> > > > +
> > > > +    /* If we've missed an event, let it re-fire through QEMU */
> > > > +    vfio_unmask_intx(vdev);
> > > > +
> > > > +    DPRINTF("%s(%04x:%02x:%02x.%x) KVM INTx accel disabled\n",
> > > > +            __func__, vdev->host.domain, vdev->host.bus,
> > > > +            vdev->host.slot, vdev->host.function);
> > > > +#endif
> > > > +}
> > > > +
> > > > +static void vfio_update_irq(PCIDevice *pdev)
> > > > +{
> > > > +    VFIODevice *vdev = DO_UPCAST(VFIODevice, pdev, pdev);
> > > > +    PCIINTxRoute route;
> > > > +
> > > > +    if (vdev->interrupt != VFIO_INT_INTx) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    route = vfio_pci_device_route_intx_to_irq(&vdev->pdev, vdev->intx.pin);
> > > > +
> > > > +    if (!vfio_pci_intx_route_changed(&vdev->intx.route, &route)) {
> > > > +        return; /* Nothing changed */
> > > > +    }
> > > > +
> > > > +    DPRINTF("%s(%04x:%02x:%02x.%x) IRQ moved %d -> %d\n", __func__,
> > > > +            vdev->host.domain, vdev->host.bus, vdev->host.slot,
> > > > +            vdev->host.function, vdev->intx.route.irq, route.irq);
> > > > +
> > > > +    vfio_disable_intx_kvm(vdev);
> > > > +
> > > > +    vdev->intx.route = route;
> > > > +
> > > > +    if (route.mode != PCI_INTX_ENABLED) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    vfio_enable_intx_kvm(vdev);
> > > > +
> > > > +    /* Re-enable the interrupt in cased we missed an EOI */
> > > > +    vfio_eoi(vdev);
> > > > +}
> > > > +
> > > >  static int vfio_enable_intx(VFIODevice *vdev)
> > > >  {
> > > >      uint8_t pin = vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1);
> > > > @@ -262,6 +479,9 @@ static int vfio_enable_intx(VFIODevice *vdev)
> > > >      vfio_disable_interrupts(vdev);
> > > >  
> > > >      vdev->intx.pin = pin - 1; /* Pin A (1) -> irq[0] */
> > > > +    vdev->intx.route = vfio_pci_device_route_intx_to_irq(&vdev->pdev,
> > > > +                                                         vdev->intx.pin);
> > > > +
> > > >      ret = event_notifier_init(&vdev->intx.interrupt, 0);
> > > >      if (ret) {
> > > >          error_report("vfio: Error: event_notifier_init failed\n");
> > > > @@ -290,6 +510,8 @@ static int vfio_enable_intx(VFIODevice *vdev)
> > > >          return -errno;
> > > >      }
> > > >  
> > > > +    vfio_enable_intx_kvm(vdev);
> > > > +
> > > >      vdev->interrupt = VFIO_INT_INTx;
> > > >  
> > > >      DPRINTF("%s(%04x:%02x:%02x.%x)\n", __func__, vdev->host.domain,
> > > > @@ -303,6 +525,7 @@ static void vfio_disable_intx(VFIODevice *vdev)
> > > >      int fd;
> > > >  
> > > >      qemu_del_timer(vdev->intx.mmap_timer);
> > > > +    vfio_disable_intx_kvm(vdev);
> > > >      vfio_disable_irqindex(vdev, VFIO_PCI_INTX_IRQ_INDEX);
> > > >      vdev->intx.pending = false;
> > > >      qemu_set_irq(vdev->pdev.irq[vdev->intx.pin], 0);
> > > > @@ -1870,6 +2093,7 @@ static int vfio_initfn(PCIDevice *pdev)
> > > >      if (vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
> > > >          vdev->intx.mmap_timer = qemu_new_timer_ms(vm_clock,
> > > >                                                    vfio_intx_mmap_enable, vdev);
> > > > +        pci_device_set_intx_routing_notifier(&vdev->pdev, vfio_update_irq);
> > > >          ret = vfio_enable_intx(vdev);
> > > >          if (ret) {
> > > >              goto out_teardown;
> > 
> > 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html