Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 26 Jul 2012 11:40:00 -0600

On Thu, 2012-07-26 at 19:34 +0300, Avi Kivity wrote:
> On 07/25/2012 08:03 PM, Alex Williamson wrote:
> 
> > +/*
> > + * Resource setup
> > + */
> > +static void vfio_unmap_bar(VFIODevice *vdev, int nr)
> > +{
> > +    VFIOBAR *bar = &vdev->bars[nr];
> > +    uint64_t size;
> > +
> > +    if (!memory_region_size(&bar->mem)) {
> > +        return;
> > +    }

This one is the "slow" mapped MemoryRegion.  If there's nothing here,
the BAR isn't populated.

> > +
> > +    size = memory_region_size(&bar->mmap_mem);
> > +    if (size) {
> > +         memory_region_del_subregion(&bar->mem, &bar->mmap_mem);
> > +         munmap(bar->mmap, size);
> > +    }

This is the direct mapped MemoryRegion that potentially overlays the
"slow" mapping above for MMIO BARs of sufficient alignment.  If the BAR
includes the MSI-X vector table, this maps the region in front of the
table

> > +
> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
> > +        size = memory_region_size(&vdev->msix->mmap_mem);
> > +        if (size) {
> > +            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
> > +            munmap(vdev->msix->mmap, size);
> > +        }
> > +    }

And this one potentially unmaps the overlap after the vector table if
there's any space for one.

> Are the three size checks needed? Everything should work without them
> from the memory core point of view.

I haven't tried, but I strongly suspect I shouldn't be munmap'ing
NULL... no?

> > +
> > +    memory_region_destroy(&bar->mem);
> > +}
> > +
> > +static int vfio_mmap_bar(VFIOBAR *bar, MemoryRegion *mem, MemoryRegion *submem,
> > +                         void **map, size_t size, off_t offset,
> > +                         const char *name)
> > +{
> > +    *map = mmap(NULL, size, PROT_READ | PROT_WRITE,
> > +                MAP_SHARED, bar->fd, bar->fd_offset + offset);
> > +    if (*map == MAP_FAILED) {
> > +        *map = NULL;
> > +        return -1;
> > +    }
> > +
> > +    memory_region_init_ram_ptr(submem, name, size, *map);
> > +    memory_region_add_subregion(mem, offset, submem);
> 
> 
> 
> > +
> > +    return 0;
> > +}
> > +
> > +static void vfio_map_bar(VFIODevice *vdev, int nr, uint8_t type)
> > +{
> > +    VFIOBAR *bar = &vdev->bars[nr];
> > +    unsigned size = bar->size;
> > +    char name[64];
> > +
> > +    sprintf(name, "VFIO %04x:%02x:%02x.%x BAR %d", vdev->host.domain,
> > +            vdev->host.bus, vdev->host.slot, vdev->host.function, nr);
> > +
> > +    /* A "slow" read/write mapping underlies all BARs */
> > +    memory_region_init_io(&bar->mem, &vfio_bar_ops, bar, name, size);
> > +    pci_register_bar(&vdev->pdev, nr, type, &bar->mem);
> 
> So far all container BARs have been pure containers, without RAM or I/O
> callbacks.  It should all work, but this sets precedent and requires it
> to work.  I guess there's no problem supporting it though.

KVM device assignment already makes use of this as well, if I understand
correctly.

> > +
> > +    if (type & PCI_BASE_ADDRESS_SPACE_IO) {
> > +        return; /* IO space is only slow, don't expect high perf here */
> > +    }
> 
> What about non-x86 where IO is actually memory?  I think you can drop
> this and let the address space filtering in the listener drop it if it
> turns out to be in IO space.

They're probably saying "What's I/O port space?" ;)  Yeah, there may be
some room to do more here, but no need until we have something that can
make use of it.  Note that these are the BAR mappings, which turn into
MemoryRegions, so I'm not sure what the listener has to do with
filtering these just yet.

> > +
> > +    if (size & ~TARGET_PAGE_MASK) {
> > +        error_report("%s is too small to mmap, this may affect performance.\n",
> > +                     name);
> > +        return;
> > +    }
> 
> We can work a little harder and align the host space offset with the
> guest space offset, and map it in.

That's actually pretty involved, requiring shifting the device in the
host address space and potentially adjust port and bridge apertures to
enable room for the device.  Not to mention that it assumes accessing
dead space between device regions is no harm, no foul.  True on x86 now,
but wasn't true on HP ia64 chipsets and I suspect some other platforms.

> > +
> > +    /*
> > +     * We can't mmap areas overlapping the MSIX vector table, so we
> > +     * potentially insert a direct-mapped subregion before and after it.
> > +     */
> 
> This splitting is what the memory core really enjoys.  You can just
> place the MSIX page over the RAM page and let it do the cut-n-paste.

Sure, but VFIO won't allow us to mmap over the MSI-X table for security
reasons.  It might be worthwhile to someday make VFIO insert an
anonymous page over the MSI-X table to allow this, but it didn't look
trivial for my novice mm abilities.  Easy to add a flag from the VFIO
kernel structure where we learn about this BAR if we add it in the
future.

> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
> > +        size = vdev->msix->table_offset & TARGET_PAGE_MASK;
> > +    }
> > +         
> > +    if (size) {
> > +        strcat(name, " mmap");
> > +        if (vfio_mmap_bar(bar, &bar->mem, &bar->mmap_mem, &bar->mmap,
> > +                          size, 0, name)) {
> > +            error_report("%s Failed. Performance may be slow\n", name);
> > +        }
> > +    }
> > +
> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
> > +        unsigned start;
> > +
> > +        start = TARGET_PAGE_ALIGN(vdev->msix->table_offset +
> > +                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE));
> > +
> > +        if (start < bar->size) {
> > +            size = bar->size - start;
> > +            strcat(name, " msix-hi");
> > +            /* MSIXInfo contains another MemoryRegion for this mapping */
> > +            if (vfio_mmap_bar(bar, &bar->mem, &vdev->msix->mmap_mem,
> > +                              &vdev->msix->mmap, size, start, name)) {
> > +                error_report("%s Failed. Performance may be slow\n", name);
> > +            }
> > +        }
> > +    }
> > +
> > +    return;
> > +}
> > +
> > +
> > +static int __vfio_get_device(VFIOGroup *group,
> > +                             const char *name, VFIODevice *vdev)
> 
> __foo is a reserved symbol.

sigh, ok

> > +{
> > +    int ret;
> > +
> > +    ret = ioctl(group->fd, VFIO_GROUP_GET_DEVICE_FD, name);
> > +    if (ret < 0) {
> > +        error_report("vfio: error getting device %s from group %d: %s",
> > +                     name, group->groupid, strerror(errno));
> > +        error_report("Verify all devices in group %d "
> > +                     "are bound to vfio-pci or pci-stub and not already in use",
> > +                     group->groupid);
> > +        return -1;
> > +    }
> > +
> > +    vdev->group = group;
> > +    QLIST_INSERT_HEAD(&group->device_list, vdev, next);
> > +
> > +    vdev->fd = ret;
> > +
> > +    return 0;
> > +}
> > +
> > +
> > +static Property vfio_pci_dev_properties[] = {
> > +    DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIODevice, host),
> > +    //TODO - support passed fds... is this necessary?
> 
> Yes.

This is actually kind of complicated.  Opening /dev/vfio/vfio gives us
an instance of a container in the kernel.  A group can only be attached
to one container.  So whoever calls us with passed fds needs to track
this very carefully.  This is also why I've dropped any kind of shared
IOMMU option to give us a hint whether to try to cram everything in the
same container (~= iommu domain).  It's too easy to pass conflicting
info to share a container for one device, but not another... yet they
may be in the same group.  I'll work on the fd passing though and try to
come up with a reasonable model.

> > +    //DEFINE_PROP_STRING("vfiofd", VFIODevice, vfiofd_name),
> > +    //DEFINE_PROP_STRING("vfiogroupfd, VFIODevice, vfiogroupfd_name),
> > +    DEFINE_PROP_END_OF_LIST(),
> > +};
> > +
> > +
> > +
> > +typedef struct MSIVector {
> > +    EventNotifier interrupt; /* eventfd triggered on interrupt */
> > +    struct VFIODevice *vdev; /* back pointer to device */
> > +    int vector; /* the vector number for this element */
> > +    int virq; /* KVM irqchip route for Qemu bypass */
> 
> This calls for an abstraction (don't we have a cache where we look those
> up?)

I haven't see one, pointer?  I tried to follow vhost's lead here.

> > +    bool use;
> > +} MSIVector;
> > +
> > +
> > +typedef struct VFIOContainer {
> > +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> > +    struct {
> > +        /* enable abstraction to support various iommu backends */
> > +        union {
> > +            MemoryListener listener; /* Used by type1 iommu */
> > +        };
> 
> The usual was is to have a Type1VFIOContainer deriving from
> VFIOContainer and adding a MemoryListener.

Yep, that would work too.  It gets a bit more complicated that way
though because we have to know when the container is allocated what type
it's going to be.  This way we can step though possible iommu types and
support the right one.  Eventually there may be more than one type
supported on the same platform (ex. one that enables PRI).  Do-able, but
I'm not sure it's worth it at this point.

> > +        void (*release)(struct VFIOContainer *);
> > +    } iommu_data;
> > +    QLIST_HEAD(, VFIOGroup) group_list;
> > +    QLIST_ENTRY(VFIOContainer) next;
> > +} VFIOContainer;
> > +
> 
> > +#endif /* __VFIO_H__ */
> > diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> > index 5a9d4e3..bd1a76c 100644
> > --- a/linux-headers/linux/kvm.h
> > +++ b/linux-headers/linux/kvm.h
> 
> Separate patch when leaving RFC mode.

Sure, this is still RFC though since the irqfd/eoifd changes are
pending.

> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > new file mode 100644
> > index 0000000..0a4f180
> > --- /dev/null
> > +++ b/linux-headers/linux/vfio.h
> > @@ -0,0 +1,445 @@
> > +/*
> > + * VFIO API definition
> > + *
> > + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@xxxxxxxxxx>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +#ifndef VFIO_H
> > +#define VFIO_H
> > +
> > +#include <linux/types.h>
> > +#include <linux/ioctl.h>
> > +
> > +#define VFIO_API_VERSION	0
> > +
> > +#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */
> 
> Use the exported file, that gets rid of the __KERNEL__ bits.

Oh?  How do I generate that aside from just deleting lines?  Thanks!

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html