Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device

Alex Williamson <alex.williamson@xxxxxxxxxx> · Thu, 11 Aug 2016 10:24:21 -0600

On Thu, 11 Aug 2016 21:29:35 +0530
Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:

> On 8/11/2016 4:30 AM, Alex Williamson wrote:
> > On Thu, 11 Aug 2016 02:53:10 +0530
> > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> >   
> >> On 8/10/2016 12:30 AM, Alex Williamson wrote:  
> >>> On Thu, 4 Aug 2016 00:33:52 +0530
> >>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> >>>     
> >>
> >> ...
> >>  
> >>>> +
> >>>> +		switch (info.index) {
> >>>> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> >>>> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> >>>> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);    
> >>>
> >>> No, vmdev->vfio_region_info[info.index].offset
> >>>    
> >>
> >> Ok.
> >>  
> >>>> +			info.size = vmdev->vfio_region_info[info.index].size;
> >>>> +			if (!info.size) {
> >>>> +				info.flags = 0;
> >>>> +				break;
> >>>> +			}
> >>>> +
> >>>> +			info.flags = vmdev->vfio_region_info[info.index].flags;
> >>>> +			break;
> >>>> +		case VFIO_PCI_VGA_REGION_INDEX:
> >>>> +		case VFIO_PCI_ROM_REGION_INDEX:    
> >>>
> >>> Why?  Let the vendor driver decide.
> >>>     
> >>
> >> Ok.
> >>  
> >>>> +		switch (info.index) {
> >>>> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> >>>> +		case VFIO_PCI_REQ_IRQ_INDEX:
> >>>> +			break;
> >>>> +			/* pass thru to return error */
> >>>> +		case VFIO_PCI_MSIX_IRQ_INDEX:    
> >>>
> >>> ???    
> >>
> >> Sorry, I missed to update this. Updating it.
> >>  
> >>>> +	case VFIO_DEVICE_SET_IRQS:
> >>>> +	{    
> >> ...  
> >>>> +
> >>>> +		if (parent->ops->set_irqs)
> >>>> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> >>>> +						    hdr.start, hdr.count, data);
> >>>> +
> >>>> +		kfree(ptr);
> >>>> +		return ret;    
> >>>
> >>> Return success if no set_irqs callback?
> >>>    
> >>
> >> Ideally, vendor driver should provide this function. If vendor driver
> >> doesn't provide it, do we really need to fail here?  
> > 
> > Wouldn't you as a user expect to get an error if you try to call an
> > ioctl that has no backing rather than assume success and never receive
> > and interrupt?
> >    
> 
> If we really don't want to proceed if set_irqs() is not provided then
> its better to add it in mandatory list in mdev_register_device() in
> mdev_core.c and fail earlier, i.e. fail to register the device.

Is a device required to implement some form of interrupt to be useful?
What if there's a memory-only device that does not report INTx or
provide MSI or MSI-X capabilities?  It could still be PCI spec
complaint.  Really though it's just a matter of whether we're going to
require the mediated driver to provide a set_irqs() stub or let them
skip it and return error ourselves.  Either is really fine with me, but
we can't return success for an ioctl that has no backing.

> >>>> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
> >>>> +			      size_t count, loff_t *ppos)
> >>>> +{
> >>>> +	struct vfio_mdev *vmdev = device_data;
> >>>> +	struct mdev_device *mdev = vmdev->mdev;
> >>>> +	struct parent_device *parent = mdev->parent;
> >>>> +	int ret = 0;
> >>>> +
> >>>> +	if (!count)
> >>>> +		return 0;
> >>>> +
> >>>> +	if (parent->ops->read) {
> >>>> +		char *ret_data, *ptr;
> >>>> +
> >>>> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);    
> >>>
> >>> Do we really need to support arbitrary lengths in one shot?  Seems like
> >>> we could just use a 4 or 8 byte variable on the stack and iterate until
> >>> done.
> >>>     
> >>
> >> We just want to pass the arguments to vendor driver as is here. Vendor
> >> driver could take care of that.  
> > 
> > But I think this is exploitable, it lets the user make the kernel
> > allocate an arbitrarily sized buffer.
> >    
> >>>> +
> >>>> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
> >>>> +			       size_t count, loff_t *ppos)
> >>>> +{
> >>>> +	struct vfio_mdev *vmdev = device_data;
> >>>> +	struct mdev_device *mdev = vmdev->mdev;
> >>>> +	struct parent_device *parent = mdev->parent;
> >>>> +	int ret = 0;
> >>>> +
> >>>> +	if (!count)
> >>>> +		return 0;
> >>>> +
> >>>> +	if (parent->ops->write) {
> >>>> +		char *usr_data, *ptr;
> >>>> +
> >>>> +		ptr = usr_data = memdup_user(buf, count);    
> >>>
> >>> Same here, how much do we care to let the user write in one pass and is
> >>> there any advantage to it?  When QEMU is our userspace we're only
> >>> likely to see 4-byte accesses anyway.    
> >>
> >> Same as above.
> >>  
> >>>> +
> >>>> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> >>>> +{    
> >> ...  
> >>>> +	} else {
> >>>> +		struct pci_dev *pdev;
> >>>> +
> >>>> +		virtaddr = vma->vm_start;
> >>>> +		req_size = vma->vm_end - vma->vm_start;
> >>>> +
> >>>> +		pdev = to_pci_dev(parent->dev);
> >>>> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);    
> >>>
> >>> Iterate through region_info[*].offset/size provided by vendor driver.
> >>>     
> >>
> >> Yes, makes sense.
> >>  
> >>>> +
> >>>> +int vfio_mpci_match(struct device *dev)
> >>>> +{
> >>>> +	if (dev_is_pci(dev->parent))    
> >>>
> >>> This is the wrong test, there's really no requirement that a pci mdev
> >>> device is hosted by a real pci device.      
> >>
> >> Ideally this module is for the mediated device whose parent is PCI
> >> device. And we are relying on kernel functions like
> >> pci_resource_start(), to_pci_dev() in this module, so better to check it
> >> while loading.  
> > 
> > IMO, we don't want to care what the parent device is, it's not ideal,
> > it's actually a limitation to impose that it is a PCI device.  I want to
> > be able to make purely virtual mediated devices.  I only see that you
> > use these functions in the mmio fault handling.  Is it useful to assume
> > that on mmio fault we map to the parent device PCI BAR regions?  Just
> > require that the vendor driver provides a fault mapping function or
> > SIGBUS if we get a fault and it doesn't.
> >   
> >>> Can't we check that the device
> >>> is on an mdev_pci_bus_type?
> >>>     
> >>
> >> I didn't get this part.
> >>
> >> Each mediated device is of mdev_bus_type. But VFIO module could be
> >> different based on parent device type and loaded at the same time. For
> >> example, there should be different modules for channel IO or any other
> >> type of devices and could be loaded at the same time. Then when mdev
> >> device is created based on check in match() function of each module, and
> >> proper driver would be linked for that mdev device.
> >>
> >> If this check is not based on parent device type, do you expect to set
> >> parent device type by vendor driver and accordingly load corresponding
> >> VFIO driver?  
> > 
> > mdev_pci_bus_type was an off the cuff response since the driver.bus
> > controls which devices a probe function will see.  If we have a unique
> > bus for a driver and create devices appropriately, we really don't
> > even need a match function.   
> 
> I still think that all types of mdev devices should have unique bus type
> so that VFIO IOMMU module could be used for any type of mediated device
> without any change. Otherwise we have to add checks for all supported
> bus types in vfio_iommu_type1_attach_group().

Good point, so perhaps the vendor driver reporting the type through
vfio_device_info.flags is the way to go.

> > That would still work, but what if you
> > made a get_device_info callback to the vendor driver rather than
> > creating that info in the mediated bus driver layer.  Then the probe
> > function here could simply check the flags to see if the device is
> > VFIO_DEVICE_FLAGS_PCI?
> >   
> 
> Right. get_device_info() would be a mandatory callback and it would be
> vendor driver's responsibility to return proper flag.

Yep, then we don't care what the parent device is, the flags will tell
us that the mediated device adheres to PCI and that's all we care about
for binding here.

> >>>> @@ -18,6 +18,7 @@
> >>>>  #include <linux/uaccess.h>
> >>>>  #include <linux/io.h>
> >>>>  #include <linux/vgaarb.h>
> >>>> +#include <linux/vfio.h>
> >>>>  
> >>>>  #include "vfio_pci_private.h"
> >>>>  
> >>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >>>> index 0ecae0b1cd34..431b824b0d3e 100644
> >>>> --- a/include/linux/vfio.h
> >>>> +++ b/include/linux/vfio.h
> >>>> @@ -18,6 +18,13 @@
> >>>>  #include <linux/poll.h>
> >>>>  #include <uapi/linux/vfio.h>
> >>>>  
> >>>> +#define VFIO_PCI_OFFSET_SHIFT   40
> >>>> +
> >>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> >>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> >>>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> >>>> +
> >>>> +    
> >>>
> >>> Nak this, I'm not interested in making this any sort of ABI.
> >>>     
> >>
> >> These macros are used by drivers/vfio/pci/vfio_pci.c and
> >> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
> >> they should be moved to common place as you suggested in earlier
> >> reviews. I think this is better common place. Are there any other
> >> suggestion?  
> > 
> > They're only used in ways that I objected to above and you've agreed
> > to.  These define implementation details that must not become part of
> > the mediated vendor driver ABI.  A vendor driver is free to redefine
> > this the same if they want, but as we can see with how easily they slip
> > into code where they don't belong, the only way to make sure they don't
> > become ABI is to keep them in private headers.
> >    
> 
> Then I think, I can't use these macros in mdev modules, they are defined
> in drivers/vfio/pci/vfio_pci_private.h
> I have to define similar macros in drivers/vfio/mdev/mdev_private.h?
> 
> parent->ops->get_region_info() is called from vfio_mpci_open() that is
> before PCI config space is setup. Main expectation from
> get_region_info() was to get flags and size. At this point of time
> vendor driver also don't know about the base addresses of regions.
> 
>     case VFIO_DEVICE_GET_REGION_INFO:
> ...
> 
>         info.offset = vmdev->vfio_region_info[info.index].offset;
> 
> In that case, as suggested in previous reply, above is not going to work.
> I'll define such macros in drivers/vfio/mdev/mdev_private.h, set above
> offset according to these macros. Then on first access to any BAR
> region, i.e. after PCI config space is populated, call
> parent->ops->get_region_info() again so that
> vfio_region_info[index].offset for all regions are set by vendor driver.
> Then use these offsets to calculate 'pos' for
> read/write/validate_map_request(). Does this seems reasonable?

This doesn't make any sense to me, there should be absolutely no reason
for the mid-layer mediated device infrastructure to impose region
offsets.  vfio-pci is a leaf driver, like the mediated vendor driver.
Only the leaf drivers can define how they layout the offsets within the
device file descriptor.  Being a VFIO_PCI device only defines region
indexes to resources, not offsets (ie. region 0 is BAR0, region 1 is
BAR1,... region 7 is PCI config space).  If this mid-layer even needs
to know region offsets, then caching them on opening the vendor device
is certainly sufficient.  Remember we're talking about the offset into
the vfio device file descriptor, how that potentially maps onto a
physical MMIO space later doesn't matter here.  It seems like maybe
we're confusing those points.  Anyway, the more I hear about needing to
reproduce these INDEX/OFFSET translation macros in places they
shouldn't be used, the more confident I am in keeping them private.
Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html