Re: [RFC PATCH] vfio: VFIO Driver core framework

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 17 Nov 2011 13:22:17 -0700

On Wed, 2011-11-16 at 11:52 -0500, Konrad Rzeszutek Wilk wrote:
> On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
<snip>
> > > > +
> > > > +Regions are described by a struct vfio_region_info, which is retrieved by
> > > > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> > > > +the desired region (0 based index).  Note that devices may implement zero
> > > > 
> > > +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> > > > +mapping).
> > > 
> > > Huh?
> > 
> > PCI has the following static mapping:
> > 
> > enum {
> >         VFIO_PCI_BAR0_REGION_INDEX,
> >         VFIO_PCI_BAR1_REGION_INDEX,
> >         VFIO_PCI_BAR2_REGION_INDEX,
> >         VFIO_PCI_BAR3_REGION_INDEX,
> >         VFIO_PCI_BAR4_REGION_INDEX,
> >         VFIO_PCI_BAR5_REGION_INDEX,
> >         VFIO_PCI_ROM_REGION_INDEX,
> >         VFIO_PCI_CONFIG_REGION_INDEX,
> >         VFIO_PCI_NUM_REGIONS
> > };
> > 
> > So 8 regions are always reported regardless of whether the device
> > implements all the BARs and the ROM.  Then we have a fixed bar:index
> > mapping so we don't have to create a region_info field to describe the
> > bar number for the index.
> 
> OK. Is that a problem if the real device actually has a zero sized BAR?
> Or is zero sized BAR in PCI spec equal to "disabled, not in use" ? Just
> wondering whether (-1ULL) should be used instead? (Which seems the case
> in QEMU code).

Yes, PCI spec defines that unimplemented BARs are hardwired to zero, so
the sizing operation returns zero for the size.

<snip>
> > > > +struct vfio_irq_info {
> > > > +        __u32   len;            /* length of structure */
> > > > +        __u32   index;          /* IRQ number */
> > > > +        __u32   count;          /* number of individual IRQs */
> > > > +        __u64   flags;
> > > > +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> > > > +};
> > > > +
> > > > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > > > +type to index mapping).
> > > 
> > > I am not really sure what that means.
> > 
> > This is so PCI can expose:
> > 
> > enum {
> >         VFIO_PCI_INTX_IRQ_INDEX,
> >         VFIO_PCI_MSI_IRQ_INDEX,
> >         VFIO_PCI_MSIX_IRQ_INDEX,
> >         VFIO_PCI_NUM_IRQS
> > };
> > 
> > So like regions it always exposes 3 IRQ indexes where count=0 if the
> > device doesn't actually support that type of interrupt.  I just want to
> > spell out that bus drivers have this kind of flexibility.
> 
> I think you should change the comment that  says 'IRQ number', as the
> first thing that comes in my mind is 'GSI' or MSI/MSI-x vector.
> Perhaps '/* index to be used with return value from GET_NUM_IRQS ioctl.
> Order of structures can be unsorted. */

Ah, yes.  I see the confusion.  They can't really be unsorted though,
the user needs some point of reference.  For PCI they will be strictly
ordered.  For Device Tree, I assume there will be a path referencing the
index.  I'll update the doc to clarify.

<snip>
> > > > +
> > > > +When a level triggered interrupt is signaled, the interrupt is masked
> > > > +on the host.  This prevents an unresponsive userspace driver from
> > > > +continuing to interrupt the host system.  After servicing the interrupt,
> > > > +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> > > > +triggered interrupts implicitly have a count of 1 per index.
> > > 
> > > So they are enabled automatically? Meaning you don't even hav to do
> > > SET_IRQ_EVENTFDS b/c the count is set to 1?
> > 
> > I suppose that should be "no more than 1 per index" (ie. PCI would
> > report a count of 0 for VFIO_PCI_INTX_IRQ_INDEX if the device doesn't
> > support INTx).  I think you might be confusing VFIO_DEVICE_GET_IRQ_INFO
> > which tells how many are available with VFIO_DEVICE_SET_IRQ_EVENTFDS
> > which does the enabling/disabling.  All interrupts are disabled by
> > default because userspace needs to give us a way to signal them via
> > eventfds.  It will be device dependent whether multiple index can be
> > enabled simultaneously.  Hmm, is that another flag on the irq_info
> > struct or do we expect drivers to implicitly have that kind of
> > knowledge?
> 
> Right, that was what I was wondering. Not sure how the PowerPC
> world works with this.

On second thought, I think an exclusive flag isn't appropriate.  VFIO is
not meant to abstract the device to the level that a user could write a
generic "vfio driver".  The user will always need to understand the type
of device, VFIO just provides the conduit to make use of it.  There's
too much left undefined with a simplistic exclusive flag.

<snip>
> > > > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> > > 
> > > So only level triggered? Hmm, how do I know whether the device is
> > > level or edge? Or is that edge (MSI) can also be unmaked using the
> > > eventfs
> > 
> > Yes, only for level.  Isn't a device going to know what type of
> > interrupt it uses?  MSI masking is PCI specific, not handled by this.
> 
> I certainly hope it knows, but you know buggy drivers do exist.
> 
> What would be the return value if somebody tried to unmask an edge one?
> Should that be documented here? -ENOSPEC?

I would assume EINVAL or EFAULT since the user is providing an invalid
argument/bad address.

> > > > +
> > > > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > > > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> > > > +
> > > > +When supported, as indicated by the device flags, reset the device.
> > > > +
> > > > +#define VFIO_DEVICE_RESET               _IO(';', 116)
> > > 
> > > Does it disable the 'count'? Err, does it disable the IRQ on the
> > > device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
> > > to set new eventfds? Or does it re-use the eventfds and the device
> > > is enabled after this?
> > 
> > It doesn't affect the interrupt programming.  Should it?
> 
> I would hope not, but I am trying to think of ways one could screw this up.
> Perhaps just saying that - "No need to call VFIO_DEVICE_SET_IRQ_EVENTFDS
> as the kernel (and the device) will retain the interrupt.".

Ok, I added some words around this in the doc.

> .. snip..
> > > I am not really sure what this section purpose is? Could this be part
> > > of the header file or the code? It does not look to be part of the
> > > ioctl API?
> > 
> > We've passed into the "VFIO bus driver API" section of the document, to
> > explain the interaction between vfio-core and vfio bus drivers.
> 
> Perhaps a different file?

The entire file is ~300 lines.  Seems excessive to split.

<snip>
> > > > +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> > > > +				    struct vfio_device *device)
> > > > +{
> > > > +	BUG_ON(!iommu->domain && device->attached);
> > > 
> > > Whoa. Heavy hammer there.
> > > 
> > > Perhaps WARN_ON as you do check it later on.
> > 
> > I think it's warranted, internal consistency is broken if we have a
> > device that thinks it's attached to an iommu domain that doesn't exist.
> > It should, of course, never happen and this isn't a performance path.
> > 
> > > > +
> > > > +	if (!iommu->domain || !device->attached)
> > > > +		return;
> 
> Well, the deal is that you BUG_ON earlier, but you check for it here.
> But the BUG_ON will stop execution , so the check 'if ..' is actually
> not needed.

The BUG_ON is a subtly different check:

domain | attached
-------+---------
   0   |   0     Nothing to do
   0   |   1     <--- BUG_ON, we're broken
   1   |   0     Nothing to do
   1   |   1     Do stuff

Writing out the truth table, I see now I could just make this:
   if (!attached) {return;}
since the BUG_ON takes care of the other case.

The reason for the laziness of allowing this to simply return is that if
we hit an error attaching an individual device within a group, we just
push the whole group back through __vfio_iommu_detach_group(), so some
devices may have never been attached.

> > > > +
> > > > +	iommu_detach_device(iommu->domain, device->dev);
> > > > +	device->attached = false;
> > > > +}
> > > > +
> > > > +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> > > > +				      struct vfio_group *group)
> > > > +{
> > > > +	struct list_head *pos;
> > > > +
> > > > +	list_for_each(pos, &group->device_list) {
> > > > +		struct vfio_device *device;
> > > > +
> > > > +		device = list_entry(pos, struct vfio_device, device_next);
> > > > +		__vfio_iommu_detach_dev(iommu, device);
> > > > +	}
> > > > +}
> > > > +
> > > > +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> > > > +				   struct vfio_device *device)
> > > > +{
> > > > +	int ret;
> > > > +
> > > > +	BUG_ON(device->attached);
> > > 
> > > How about:
> > > 
> > > WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register
> > > the device again! Tell him/her to stop please.\n");
> > 
> > I would almost demote this one to a WARN_ON, but userspace isn't in
> > control of attaching and detaching devices from the iommu.  That's a
> > side effect of getting the iommu or device file descriptor.  So again,
> > this is an internal consistency check and it should never happen,
> > regardless of userspace.
> > 
> 
> Ok, then you might want to expand it to
> 
> BUG_ON(!device  || device->attached);
> 
> In case something has gone horribly wrong.

Impressive, that exceeds even my paranoia ;)  For that we would have had
to walk the group->device_list and come up with a NULL device pointer.
I think we can assume that won't happen.  I've also got this though:

        if (!iommu || !iommu->domain)
                return -EINVAL;

Which is effectively just being lazy without a good excuse like above.
That could probably be folded into the BUG_ON.
> 
> .. snip..
> > > > +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> > > > +		device_create(vfio.class, NULL, group->devt,
> > > > +			      group, "%u", groupid);
> > > > +
> > > > +		group->bus = dev->bus;
> > > 
> > > 
> > > Oh, so that is how the IOMMU iommu_ops get copied! You might
> > > want to mention that - I was not sure where the 'handoff' is
> > > was done to insert a device so that it can do iommu_ops properly.
> > > 
> > > Ok, so the time when a device is detected whether it can do
> > > IOMMU is when we try to open it - as that is when iommu_domain_alloc
> > > is called which can return NULL if the iommu_ops is not set.
> > > 
> > > So what about devices that don't have an iommu_ops? Say they
> > > are using SWIOTLB? (like the AMD-Vi sometimes does if the
> > > device is not on its list).
> > > 
> > > Can we use iommu_present?
> > 
> > I'm not sure I'm following your revelation ;)  Take a look at the
> 
> I am trying to figure out who sets the iommu_ops call on the devices.

The iommu driver registers ops with bus_set_iommu, so then we just need
to pass the bus pointer and iommu_ops figures out the rest.  If there's
no iommu_ops for a device or the iommu_ops doesn't implement the
device_group callback, it gets skipped by vfio and therefore won't be
usable by this interface.

> > pointer to iommu_device_group I pasted above, or these:
> > 
> > https://github.com/awilliam/linux-vfio/commit/37dd08c90d149caaed7779d4f38850a8f7ed0fa5
> > https://github.com/awilliam/linux-vfio/commit/63ca8543533d8130db23d7949133e548c3891c97
> > https://github.com/awilliam/linux-vfio/commit/8d7d70eb8e714fbf8710848a06f8cab0c741631e
> > 
> > That call includes an iommu_present() check, so if there's no iommu or
> > the iommu can't provide a groupid, the device is skipped over from vfio
> > (can't be used).
> > 
> > So the ordering is:
> > 
> >  - bus driver registers device
> >    - if it has an iommu group, add it to the vfio device/group tracking
> > 
> >  - group gets opened
> >    - user gets iommu or device fd results in iommu_domain_alloc
> > 
> > Devices without iommu_ops don't get to play in the vfio world.
> 
> Right, and I think the answer of which devices get iommu_ops is done via
> bus_set_iommu.

Exactly.

> (Thinking in long-term of what would be required to make this work
> with Xen and it sounds like I will need to implement a Xen IOMMU driver)

Yeah, that would make sense.  Thanks!

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html