Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device

Alex Williamson <alex.williamson@xxxxxxxxxx> · Wed, 10 Aug 2016 17:00:44 -0600

On Thu, 11 Aug 2016 02:53:10 +0530
Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:

> On 8/10/2016 12:30 AM, Alex Williamson wrote:
> > On Thu, 4 Aug 2016 00:33:52 +0530
> > Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> >   
> 
> ...
> 
> >> +
> >> +		switch (info.index) {
> >> +		case VFIO_PCI_CONFIG_REGION_INDEX:
> >> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
> >> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);  
> > 
> > No, vmdev->vfio_region_info[info.index].offset
> >  
> 
> Ok.
> 
> >> +			info.size = vmdev->vfio_region_info[info.index].size;
> >> +			if (!info.size) {
> >> +				info.flags = 0;
> >> +				break;
> >> +			}
> >> +
> >> +			info.flags = vmdev->vfio_region_info[info.index].flags;
> >> +			break;
> >> +		case VFIO_PCI_VGA_REGION_INDEX:
> >> +		case VFIO_PCI_ROM_REGION_INDEX:  
> > 
> > Why?  Let the vendor driver decide.
> >   
> 
> Ok.
> 
> >> +		switch (info.index) {
> >> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
> >> +		case VFIO_PCI_REQ_IRQ_INDEX:
> >> +			break;
> >> +			/* pass thru to return error */
> >> +		case VFIO_PCI_MSIX_IRQ_INDEX:  
> > 
> > ???  
> 
> Sorry, I missed to update this. Updating it.
> 
> >> +	case VFIO_DEVICE_SET_IRQS:
> >> +	{  
> ...
> >> +
> >> +		if (parent->ops->set_irqs)
> >> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
> >> +						    hdr.start, hdr.count, data);
> >> +
> >> +		kfree(ptr);
> >> +		return ret;  
> > 
> > Return success if no set_irqs callback?
> >  
> 
> Ideally, vendor driver should provide this function. If vendor driver
> doesn't provide it, do we really need to fail here?

Wouldn't you as a user expect to get an error if you try to call an
ioctl that has no backing rather than assume success and never receive
and interrupt?

> >> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
> >> +			      size_t count, loff_t *ppos)
> >> +{
> >> +	struct vfio_mdev *vmdev = device_data;
> >> +	struct mdev_device *mdev = vmdev->mdev;
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int ret = 0;
> >> +
> >> +	if (!count)
> >> +		return 0;
> >> +
> >> +	if (parent->ops->read) {
> >> +		char *ret_data, *ptr;
> >> +
> >> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);  
> > 
> > Do we really need to support arbitrary lengths in one shot?  Seems like
> > we could just use a 4 or 8 byte variable on the stack and iterate until
> > done.
> >   
> 
> We just want to pass the arguments to vendor driver as is here. Vendor
> driver could take care of that.

But I think this is exploitable, it lets the user make the kernel
allocate an arbitrarily sized buffer.

> >> +
> >> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
> >> +			       size_t count, loff_t *ppos)
> >> +{
> >> +	struct vfio_mdev *vmdev = device_data;
> >> +	struct mdev_device *mdev = vmdev->mdev;
> >> +	struct parent_device *parent = mdev->parent;
> >> +	int ret = 0;
> >> +
> >> +	if (!count)
> >> +		return 0;
> >> +
> >> +	if (parent->ops->write) {
> >> +		char *usr_data, *ptr;
> >> +
> >> +		ptr = usr_data = memdup_user(buf, count);  
> > 
> > Same here, how much do we care to let the user write in one pass and is
> > there any advantage to it?  When QEMU is our userspace we're only
> > likely to see 4-byte accesses anyway.  
> 
> Same as above.
> 
> >> +
> >> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> >> +{  
> ...
> >> +	} else {
> >> +		struct pci_dev *pdev;
> >> +
> >> +		virtaddr = vma->vm_start;
> >> +		req_size = vma->vm_end - vma->vm_start;
> >> +
> >> +		pdev = to_pci_dev(parent->dev);
> >> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);  
> > 
> > Iterate through region_info[*].offset/size provided by vendor driver.
> >   
> 
> Yes, makes sense.
> 
> >> +
> >> +int vfio_mpci_match(struct device *dev)
> >> +{
> >> +	if (dev_is_pci(dev->parent))  
> > 
> > This is the wrong test, there's really no requirement that a pci mdev
> > device is hosted by a real pci device.    
> 
> Ideally this module is for the mediated device whose parent is PCI
> device. And we are relying on kernel functions like
> pci_resource_start(), to_pci_dev() in this module, so better to check it
> while loading.

IMO, we don't want to care what the parent device is, it's not ideal,
it's actually a limitation to impose that it is a PCI device.  I want to
be able to make purely virtual mediated devices.  I only see that you
use these functions in the mmio fault handling.  Is it useful to assume
that on mmio fault we map to the parent device PCI BAR regions?  Just
require that the vendor driver provides a fault mapping function or
SIGBUS if we get a fault and it doesn't.

> > Can't we check that the device
> > is on an mdev_pci_bus_type?
> >   
> 
> I didn't get this part.
> 
> Each mediated device is of mdev_bus_type. But VFIO module could be
> different based on parent device type and loaded at the same time. For
> example, there should be different modules for channel IO or any other
> type of devices and could be loaded at the same time. Then when mdev
> device is created based on check in match() function of each module, and
> proper driver would be linked for that mdev device.
> 
> If this check is not based on parent device type, do you expect to set
> parent device type by vendor driver and accordingly load corresponding
> VFIO driver?

mdev_pci_bus_type was an off the cuff response since the driver.bus
controls which devices a probe function will see.  If we have a unique
bus for a driver and create devices appropriately, we really don't
even need a match function.  That would still work, but what if you
made a get_device_info callback to the vendor driver rather than
creating that info in the mediated bus driver layer.  Then the probe
function here could simply check the flags to see if the device is
VFIO_DEVICE_FLAGS_PCI?

> >> @@ -18,6 +18,7 @@
> >>  #include <linux/uaccess.h>
> >>  #include <linux/io.h>
> >>  #include <linux/vgaarb.h>
> >> +#include <linux/vfio.h>
> >>  
> >>  #include "vfio_pci_private.h"
> >>  
> >> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> >> index 0ecae0b1cd34..431b824b0d3e 100644
> >> --- a/include/linux/vfio.h
> >> +++ b/include/linux/vfio.h
> >> @@ -18,6 +18,13 @@
> >>  #include <linux/poll.h>
> >>  #include <uapi/linux/vfio.h>
> >>  
> >> +#define VFIO_PCI_OFFSET_SHIFT   40
> >> +
> >> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
> >> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
> >> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
> >> +
> >> +  
> > 
> > Nak this, I'm not interested in making this any sort of ABI.
> >   
> 
> These macros are used by drivers/vfio/pci/vfio_pci.c and
> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
> they should be moved to common place as you suggested in earlier
> reviews. I think this is better common place. Are there any other
> suggestion?

They're only used in ways that I objected to above and you've agreed
to.  These define implementation details that must not become part of
the mediated vendor driver ABI.  A vendor driver is free to redefine
this the same if they want, but as we can see with how easily they slip
into code where they don't belong, the only way to make sure they don't
become ABI is to keep them in private headers.

> >> +static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
> >> +{
> >> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);  
> > 
> > This creates a fixed ABI between vfio-mdev-pci and vendor drivers that
> > a given region starts at a pre-defined offset.  We have the offset
> > stored in vfio_mdev.region_info[VFIO_PCI_CONFIG_REGION_INDEX].offset,
> > use it.  It's just as unacceptable to impose this fixed relationship
> > with a vendor driver here as if a userspace driver were to do the same.
> >   
> 
> In the v5 version, where config space was cached in this module,
> suggestion was to don't care about data or caching it at read/write,
> just pass it through. Now since VFIO_PCI_* macros are also available
> here, vendor driver can use it to decode pos to find region index and
> offset of access. Then vendor driver itself add
> vmdev->vfio_region_info[info.index].offset, which is known to him.
> Either we do this in VFIO module or vendor driver?

As I say above, a vendor driver is absolutely free to use the same
index/offset scheme, but it absolutely must not be part of the ABI
between vendor drivers and the mediated driver core.  It's up to the
vendor driver to define that relation and moving these to a common
header is clearly too dangerous.  I'm sorry if I've said otherwise in
the past, but I've only recently discovered a userspace driver (DPDK)
copying these defines and ignoring the index offsets reported through
the REGION_INFO API.  So I'm now bitterly aware how an internal
implementation detail can be abused and if we don't catch them, it's
going to lock us into an implementation that was designed to be
flexible.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html