Re: [Qemu-devel] [RFC PATCH] vfio: VFIO PCI driver for Qemu

Alex Williamson <alex.williamson@xxxxxxxxxx> · Mon, 30 Jul 2012 16:29:59 -0600

On Sun, 2012-07-29 at 16:47 +0300, Avi Kivity wrote:
> On 07/26/2012 08:40 PM, Alex Williamson wrote:
> > On Thu, 2012-07-26 at 19:34 +0300, Avi Kivity wrote:
> >> On 07/25/2012 08:03 PM, Alex Williamson wrote:
> >> 
> >> > +/*
> >> > + * Resource setup
> >> > + */
> >> > +static void vfio_unmap_bar(VFIODevice *vdev, int nr)
> >> > +{
> >> > +    VFIOBAR *bar = &vdev->bars[nr];
> >> > +    uint64_t size;
> >> > +
> >> > +    if (!memory_region_size(&bar->mem)) {
> >> > +        return;
> >> > +    }
> > 
> > This one is the "slow" mapped MemoryRegion.  If there's nothing here,
> > the BAR isn't populated.
> > 
> >> > +
> >> > +    size = memory_region_size(&bar->mmap_mem);
> >> > +    if (size) {
> >> > +         memory_region_del_subregion(&bar->mem, &bar->mmap_mem);
> >> > +         munmap(bar->mmap, size);
> >> > +    }
> > 
> > This is the direct mapped MemoryRegion that potentially overlays the
> > "slow" mapping above for MMIO BARs of sufficient alignment.  If the BAR
> > includes the MSI-X vector table, this maps the region in front of the
> > table
> 
> If the region size is zero, then both memory_region_del_subregion()
> (assuming the region is parented) and munmap() do nothing.  So you could
> call this unconditionally.

I suppose parenting them is the key.  I'm counting on memory_region_size
of zero for an uninitialized, g_malloc0() MemoryRegion.  Initializing
them just to have a parent so we can unconditionally remove them here
seems like it's just shifting complexity from one function to another.
The majority of BARs aren't even implemented, so we'd actually be
setting up a lot of dummy infrastructure for a slightly cleaner unmap
function.  I'll keep looking at this, but I'm not optimistic there's an
overall simplification here.

> >> > +
> >> > +    if (vdev->msix && vdev->msix->table_bar == nr) {
> >> > +        size = memory_region_size(&vdev->msix->mmap_mem);
> >> > +        if (size) {
> >> > +            memory_region_del_subregion(&bar->mem, &vdev->msix->mmap_mem);
> >> > +            munmap(vdev->msix->mmap, size);
> >> > +        }
> >> > +    }
> > 
> > And this one potentially unmaps the overlap after the vector table if
> > there's any space for one.
> > 
> >> Are the three size checks needed? Everything should work without them
> >> from the memory core point of view.
> > 
> > I haven't tried, but I strongly suspect I shouldn't be munmap'ing
> > NULL... no?
> 
> NULL isn't the problem (well some kernels protect against mmaping NULL
> to avoid kernel exploits), but it seems the kernel doesn't like a zero
> length.

in mm/mmap.c:do_munmap() I see:

        if ((len = PAGE_ALIGN(len)) == 0)
                return -EINVAL;

Before anything scary happens, so that should be ok.  It's not really
worthwhile to call the munmaps unconditionally if we already have the
condition tests because the subregions are unparented though.

> >> > +
> >> > +static void vfio_map_bar(VFIODevice *vdev, int nr, uint8_t type)
> >> > +{
> >> > +    VFIOBAR *bar = &vdev->bars[nr];
> >> > +    unsigned size = bar->size;
> >> > +    char name[64];
> >> > +
> >> > +    sprintf(name, "VFIO %04x:%02x:%02x.%x BAR %d", vdev->host.domain,
> >> > +            vdev->host.bus, vdev->host.slot, vdev->host.function, nr);
> >> > +
> >> > +    /* A "slow" read/write mapping underlies all BARs */
> >> > +    memory_region_init_io(&bar->mem, &vfio_bar_ops, bar, name, size);
> >> > +    pci_register_bar(&vdev->pdev, nr, type, &bar->mem);
> >> 
> >> So far all container BARs have been pure containers, without RAM or I/O
> >> callbacks.  It should all work, but this sets precedent and requires it
> >> to work.  I guess there's no problem supporting it though.
> > 
> > KVM device assignment already makes use of this as well, if I understand
> > correctly.
> 
> Okay.
> 
> > 
> >> > +
> >> > +    if (type & PCI_BASE_ADDRESS_SPACE_IO) {
> >> > +        return; /* IO space is only slow, don't expect high perf here */
> >> > +    }
> >> 
> >> What about non-x86 where IO is actually memory?  I think you can drop
> >> this and let the address space filtering in the listener drop it if it
> >> turns out to be in IO space.
> > 
> > They're probably saying "What's I/O port space?" ;)  Yeah, there may be
> > some room to do more here, but no need until we have something that can
> > make use of it. 
> 
> Most likely all that is needed is to drop the test.
> 
> > Note that these are the BAR mappings, which turn into
> > MemoryRegions, so I'm not sure what the listener has to do with
> > filtering these just yet.
> 
> +static bool vfio_listener_skipped_section(MemoryRegionSection *section)
> +{
> +    return (section->address_space != get_system_memory() ||
> +            !memory_region_is_ram(section->mr));
> +}
> 
> Or the filter argument to memory_listener_register() (which you use --
> you can drop the first check above).  On x86 those I/O regions will be
> filtered out, on non-x86 with a properly-wired chipset emulation they'll
> be passed to vfio (and kvm).

Ah, I see what you're going for now.  Sure, I can store/test the vfio
region info flags somewhere to tell me whether vfio supports mmap of the
region instead of assuming mem = mmap, io = ops.  I'll re-evaluate
unconditional removal after that.

> >> > +
> >> > +    if (size & ~TARGET_PAGE_MASK) {
> >> > +        error_report("%s is too small to mmap, this may affect performance.\n",
> >> > +                     name);
> >> > +        return;
> >> > +    }
> >> 
> >> We can work a little harder and align the host space offset with the
> >> guest space offset, and map it in.
> > 
> > That's actually pretty involved, requiring shifting the device in the
> > host address space and potentially adjust port and bridge apertures to
> > enable room for the device.  Not to mention that it assumes accessing
> > dead space between device regions is no harm, no foul.  True on x86 now,
> > but wasn't true on HP ia64 chipsets and I suspect some other platforms.
> 
> Are sub-4k BARs common?  I expect only on older cards.

Correct, mostly older devices.  I don't think I've seen any "high
performance" devices with this problem.

> >> > +
> >> > +    /*
> >> > +     * We can't mmap areas overlapping the MSIX vector table, so we
> >> > +     * potentially insert a direct-mapped subregion before and after it.
> >> > +     */
> >> 
> >> This splitting is what the memory core really enjoys.  You can just
> >> place the MSIX page over the RAM page and let it do the cut-n-paste.
> > 
> > Sure, but VFIO won't allow us to mmap over the MSI-X table for security
> > reasons.  It might be worthwhile to someday make VFIO insert an
> > anonymous page over the MSI-X table to allow this, but it didn't look
> > trivial for my novice mm abilities.  Easy to add a flag from the VFIO
> > kernel structure where we learn about this BAR if we add it in the
> > future.
> 
> I meant due it purely in qemu.  Instead of an emulated region overlaid
> by two assigned regions, have an assigned region overlaid by the
> emulated region.  The regions seen by the vfio listener will be the same.

Sure, that's what KVM device assignment does, but it requires being able
to mmap the whole BAR, including an MSI-X table.  The VFIO kernel side
can't assume userspace isn't malicious so it has to prevent this.

> >> > +
> >> > +
> >> > +static Property vfio_pci_dev_properties[] = {
> >> > +    DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIODevice, host),
> >> > +    //TODO - support passed fds... is this necessary?
> >> 
> >> Yes.
> > 
> > This is actually kind of complicated.  Opening /dev/vfio/vfio gives us
> > an instance of a container in the kernel.  A group can only be attached
> > to one container.  So whoever calls us with passed fds needs to track
> > this very carefully.  This is also why I've dropped any kind of shared
> > IOMMU option to give us a hint whether to try to cram everything in the
> > same container (~= iommu domain).  It's too easy to pass conflicting
> > info to share a container for one device, but not another... yet they
> > may be in the same group.  I'll work on the fd passing though and try to
> > come up with a reasonable model.
> 
> I didn't really follow the container stuff so I can't comment here.  But
> suppose all assigned devices are done via fd passing, isn't it
> sufficient to just pass the fd for the device (and keep the iommu group
> fd in the managment tool)?

Nope.

containerfd = open(/dev/vfio/vfio)
groupfd = open(/dev/vfio/$GROUPID)
devicefd  = ioctl(groupfd, VFIO_GROUP_GET_DEVICE_FD)

The container provides access to the iommu, the group is the unit of
ownership and privilege, and device cannot be accessed without iommu
protection.  Therefore to get to a devicefd, we first need to privilege
the container by attaching a group to it, that let's us initialize the
iommu, which allows us to get the device fd.  At a minimum, we'd need
both container and device fds, which means libvirt would be responsible
for determining what type of iommu interface to initialize.  Doing that
makes adding another device tenuous.  It's not impossible, but VFIO is
design such that /dev/vfio/vfio is completely harmless on it's own, safe
for mode 0666 access, just like /dev/kvm.  The groupfd is the important
access point, so maybe it's sufficient that libvirt could pass only that
and let qemu open /dev/vfio/vfio on it's own.  The only problem then is
that libvirt needs to pass the same groupfd for each device that gets
assigned within a group.

> >> > +
> >> > +
> >> > +typedef struct MSIVector {
> >> > +    EventNotifier interrupt; /* eventfd triggered on interrupt */
> >> > +    struct VFIODevice *vdev; /* back pointer to device */
> >> > +    int vector; /* the vector number for this element */
> >> > +    int virq; /* KVM irqchip route for Qemu bypass */
> >> 
> >> This calls for an abstraction (don't we have a cache where we look those
> >> up?)
> > 
> > I haven't see one, pointer?  I tried to follow vhost's lead here.
> 
> See kvm_irqchip_send_msi().  But this isn't integrated with irqfd yet.

Right, the irqfd is what we're really after.

> >> > +    bool use;
> >> > +} MSIVector;
> >> > +
> >> > +
> >> > +typedef struct VFIOContainer {
> >> > +    int fd; /* /dev/vfio/vfio, empowered by the attached groups */
> >> > +    struct {
> >> > +        /* enable abstraction to support various iommu backends */
> >> > +        union {
> >> > +            MemoryListener listener; /* Used by type1 iommu */
> >> > +        };
> >> 
> >> The usual was is to have a Type1VFIOContainer deriving from
> >> VFIOContainer and adding a MemoryListener.
> > 
> > Yep, that would work too.  It gets a bit more complicated that way
> > though because we have to know when the container is allocated what type
> > it's going to be.  This way we can step though possible iommu types and
> > support the right one.  Eventually there may be more than one type
> > supported on the same platform (ex. one that enables PRI).  Do-able, but
> > I'm not sure it's worth it at this point.
> 
> An alternative alternative is to put a pointer to an abstract type here,
> then you can defer the decision on the concrete type later.  But I agree
> it's not worth it at this point.  Maybe just drop the union and decide
> later when a second iommu type is added.

A pointer doesn't allow us to use container_of to get back to the
VFIOContainer from the memory listener callback, so we'd have to create
some new struct just to hold that back pointer.  Alexey's proposed POWER
support for VFIO already makes use of the union, so it seems like a
sufficient solution for now.  We'll have to re-evaluate if it's getting
unwieldy after we get a few though.

> >> > +        void (*release)(struct VFIOContainer *);
> >> > +    } iommu_data;
> >> > +    QLIST_HEAD(, VFIOGroup) group_list;
> >> > +    QLIST_ENTRY(VFIOContainer) next;
> >> > +} VFIOContainer;
> >> > +
> >> > +
> >> > +#ifdef __KERNEL__	/* Internal VFIO-core/bus driver API */
> >> 
> >> Use the exported file, that gets rid of the __KERNEL__ bits.
> > 
> > Oh?  How do I generate that aside from just deleting lines?  Thanks!
> > 
> 
> make headers_install

Thanks!

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html