Re: [PATCH v6 2/4] vfio: VFIO driver for mediated PCI device

Kirti Wankhede <kwankhede@xxxxxxxxxx> · Thu, 11 Aug 2016 21:29:35 +0530

On 8/11/2016 4:30 AM, Alex Williamson wrote:
> On Thu, 11 Aug 2016 02:53:10 +0530
> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
> 
>> On 8/10/2016 12:30 AM, Alex Williamson wrote:
>>> On Thu, 4 Aug 2016 00:33:52 +0530
>>> Kirti Wankhede <kwankhede@xxxxxxxxxx> wrote:
>>>   
>>
>> ...
>>
>>>> +
>>>> +		switch (info.index) {
>>>> +		case VFIO_PCI_CONFIG_REGION_INDEX:
>>>> +		case VFIO_PCI_BAR0_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
>>>> +			info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);  
>>>
>>> No, vmdev->vfio_region_info[info.index].offset
>>>  
>>
>> Ok.
>>
>>>> +			info.size = vmdev->vfio_region_info[info.index].size;
>>>> +			if (!info.size) {
>>>> +				info.flags = 0;
>>>> +				break;
>>>> +			}
>>>> +
>>>> +			info.flags = vmdev->vfio_region_info[info.index].flags;
>>>> +			break;
>>>> +		case VFIO_PCI_VGA_REGION_INDEX:
>>>> +		case VFIO_PCI_ROM_REGION_INDEX:  
>>>
>>> Why?  Let the vendor driver decide.
>>>   
>>
>> Ok.
>>
>>>> +		switch (info.index) {
>>>> +		case VFIO_PCI_INTX_IRQ_INDEX ... VFIO_PCI_MSI_IRQ_INDEX:
>>>> +		case VFIO_PCI_REQ_IRQ_INDEX:
>>>> +			break;
>>>> +			/* pass thru to return error */
>>>> +		case VFIO_PCI_MSIX_IRQ_INDEX:  
>>>
>>> ???  
>>
>> Sorry, I missed to update this. Updating it.
>>
>>>> +	case VFIO_DEVICE_SET_IRQS:
>>>> +	{  
>> ...
>>>> +
>>>> +		if (parent->ops->set_irqs)
>>>> +			ret = parent->ops->set_irqs(mdev, hdr.flags, hdr.index,
>>>> +						    hdr.start, hdr.count, data);
>>>> +
>>>> +		kfree(ptr);
>>>> +		return ret;  
>>>
>>> Return success if no set_irqs callback?
>>>  
>>
>> Ideally, vendor driver should provide this function. If vendor driver
>> doesn't provide it, do we really need to fail here?
> 
> Wouldn't you as a user expect to get an error if you try to call an
> ioctl that has no backing rather than assume success and never receive
> and interrupt?
>  

If we really don't want to proceed if set_irqs() is not provided then
its better to add it in mandatory list in mdev_register_device() in
mdev_core.c and fail earlier, i.e. fail to register the device.

>>>> +static ssize_t vfio_mpci_read(void *device_data, char __user *buf,
>>>> +			      size_t count, loff_t *ppos)
>>>> +{
>>>> +	struct vfio_mdev *vmdev = device_data;
>>>> +	struct mdev_device *mdev = vmdev->mdev;
>>>> +	struct parent_device *parent = mdev->parent;
>>>> +	int ret = 0;
>>>> +
>>>> +	if (!count)
>>>> +		return 0;
>>>> +
>>>> +	if (parent->ops->read) {
>>>> +		char *ret_data, *ptr;
>>>> +
>>>> +		ptr = ret_data = kzalloc(count, GFP_KERNEL);  
>>>
>>> Do we really need to support arbitrary lengths in one shot?  Seems like
>>> we could just use a 4 or 8 byte variable on the stack and iterate until
>>> done.
>>>   
>>
>> We just want to pass the arguments to vendor driver as is here. Vendor
>> driver could take care of that.
> 
> But I think this is exploitable, it lets the user make the kernel
> allocate an arbitrarily sized buffer.
>  
>>>> +
>>>> +static ssize_t vfio_mpci_write(void *device_data, const char __user *buf,
>>>> +			       size_t count, loff_t *ppos)
>>>> +{
>>>> +	struct vfio_mdev *vmdev = device_data;
>>>> +	struct mdev_device *mdev = vmdev->mdev;
>>>> +	struct parent_device *parent = mdev->parent;
>>>> +	int ret = 0;
>>>> +
>>>> +	if (!count)
>>>> +		return 0;
>>>> +
>>>> +	if (parent->ops->write) {
>>>> +		char *usr_data, *ptr;
>>>> +
>>>> +		ptr = usr_data = memdup_user(buf, count);  
>>>
>>> Same here, how much do we care to let the user write in one pass and is
>>> there any advantage to it?  When QEMU is our userspace we're only
>>> likely to see 4-byte accesses anyway.  
>>
>> Same as above.
>>
>>>> +
>>>> +static int mdev_dev_mmio_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>>> +{  
>> ...
>>>> +	} else {
>>>> +		struct pci_dev *pdev;
>>>> +
>>>> +		virtaddr = vma->vm_start;
>>>> +		req_size = vma->vm_end - vma->vm_start;
>>>> +
>>>> +		pdev = to_pci_dev(parent->dev);
>>>> +		index = VFIO_PCI_OFFSET_TO_INDEX(vma->vm_pgoff << PAGE_SHIFT);  
>>>
>>> Iterate through region_info[*].offset/size provided by vendor driver.
>>>   
>>
>> Yes, makes sense.
>>
>>>> +
>>>> +int vfio_mpci_match(struct device *dev)
>>>> +{
>>>> +	if (dev_is_pci(dev->parent))  
>>>
>>> This is the wrong test, there's really no requirement that a pci mdev
>>> device is hosted by a real pci device.    
>>
>> Ideally this module is for the mediated device whose parent is PCI
>> device. And we are relying on kernel functions like
>> pci_resource_start(), to_pci_dev() in this module, so better to check it
>> while loading.
> 
> IMO, we don't want to care what the parent device is, it's not ideal,
> it's actually a limitation to impose that it is a PCI device.  I want to
> be able to make purely virtual mediated devices.  I only see that you
> use these functions in the mmio fault handling.  Is it useful to assume
> that on mmio fault we map to the parent device PCI BAR regions?  Just
> require that the vendor driver provides a fault mapping function or
> SIGBUS if we get a fault and it doesn't.
> 
>>> Can't we check that the device
>>> is on an mdev_pci_bus_type?
>>>   
>>
>> I didn't get this part.
>>
>> Each mediated device is of mdev_bus_type. But VFIO module could be
>> different based on parent device type and loaded at the same time. For
>> example, there should be different modules for channel IO or any other
>> type of devices and could be loaded at the same time. Then when mdev
>> device is created based on check in match() function of each module, and
>> proper driver would be linked for that mdev device.
>>
>> If this check is not based on parent device type, do you expect to set
>> parent device type by vendor driver and accordingly load corresponding
>> VFIO driver?
> 
> mdev_pci_bus_type was an off the cuff response since the driver.bus
> controls which devices a probe function will see.  If we have a unique
> bus for a driver and create devices appropriately, we really don't
> even need a match function. 

I still think that all types of mdev devices should have unique bus type
so that VFIO IOMMU module could be used for any type of mediated device
without any change. Otherwise we have to add checks for all supported
bus types in vfio_iommu_type1_attach_group().

> That would still work, but what if you
> made a get_device_info callback to the vendor driver rather than
> creating that info in the mediated bus driver layer.  Then the probe
> function here could simply check the flags to see if the device is
> VFIO_DEVICE_FLAGS_PCI?
> 

Right. get_device_info() would be a mandatory callback and it would be
vendor driver's responsibility to return proper flag.

>>>> @@ -18,6 +18,7 @@
>>>>  #include <linux/uaccess.h>
>>>>  #include <linux/io.h>
>>>>  #include <linux/vgaarb.h>
>>>> +#include <linux/vfio.h>
>>>>  
>>>>  #include "vfio_pci_private.h"
>>>>  
>>>> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
>>>> index 0ecae0b1cd34..431b824b0d3e 100644
>>>> --- a/include/linux/vfio.h
>>>> +++ b/include/linux/vfio.h
>>>> @@ -18,6 +18,13 @@
>>>>  #include <linux/poll.h>
>>>>  #include <uapi/linux/vfio.h>
>>>>  
>>>> +#define VFIO_PCI_OFFSET_SHIFT   40
>>>> +
>>>> +#define VFIO_PCI_OFFSET_TO_INDEX(off)   (off >> VFIO_PCI_OFFSET_SHIFT)
>>>> +#define VFIO_PCI_INDEX_TO_OFFSET(index) ((u64)(index) << VFIO_PCI_OFFSET_SHIFT)
>>>> +#define VFIO_PCI_OFFSET_MASK    (((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
>>>> +
>>>> +  
>>>
>>> Nak this, I'm not interested in making this any sort of ABI.
>>>   
>>
>> These macros are used by drivers/vfio/pci/vfio_pci.c and
>> drivers/vfio/mdev/vfio_mpci.c and to use those in both these modules,
>> they should be moved to common place as you suggested in earlier
>> reviews. I think this is better common place. Are there any other
>> suggestion?
> 
> They're only used in ways that I objected to above and you've agreed
> to.  These define implementation details that must not become part of
> the mediated vendor driver ABI.  A vendor driver is free to redefine
> this the same if they want, but as we can see with how easily they slip
> into code where they don't belong, the only way to make sure they don't
> become ABI is to keep them in private headers.
>  

Then I think, I can't use these macros in mdev modules, they are defined
in drivers/vfio/pci/vfio_pci_private.h
I have to define similar macros in drivers/vfio/mdev/mdev_private.h?

parent->ops->get_region_info() is called from vfio_mpci_open() that is
before PCI config space is setup. Main expectation from
get_region_info() was to get flags and size. At this point of time
vendor driver also don't know about the base addresses of regions.

    case VFIO_DEVICE_GET_REGION_INFO:
...

        info.offset = vmdev->vfio_region_info[info.index].offset;

In that case, as suggested in previous reply, above is not going to work.
I'll define such macros in drivers/vfio/mdev/mdev_private.h, set above
offset according to these macros. Then on first access to any BAR
region, i.e. after PCI config space is populated, call
parent->ops->get_region_info() again so that
vfio_region_info[index].offset for all regions are set by vendor driver.
Then use these offsets to calculate 'pos' for
read/write/validate_map_request(). Does this seems reasonable?

Thanks,
Kirti

>>>> +static u8 mpci_find_pci_capability(struct mdev_device *mdev, u8 capability)
>>>> +{
>>>> +	loff_t pos = VFIO_PCI_INDEX_TO_OFFSET(VFIO_PCI_CONFIG_REGION_INDEX);  
>>>
>>> This creates a fixed ABI between vfio-mdev-pci and vendor drivers that
>>> a given region starts at a pre-defined offset.  We have the offset
>>> stored in vfio_mdev.region_info[VFIO_PCI_CONFIG_REGION_INDEX].offset,
>>> use it.  It's just as unacceptable to impose this fixed relationship
>>> with a vendor driver here as if a userspace driver were to do the same.
>>>   
>>
>> In the v5 version, where config space was cached in this module,
>> suggestion was to don't care about data or caching it at read/write,
>> just pass it through. Now since VFIO_PCI_* macros are also available
>> here, vendor driver can use it to decode pos to find region index and
>> offset of access. Then vendor driver itself add
>> vmdev->vfio_region_info[info.index].offset, which is known to him.
>> Either we do this in VFIO module or vendor driver?
> 
> As I say above, a vendor driver is absolutely free to use the same
> index/offset scheme, but it absolutely must not be part of the ABI
> between vendor drivers and the mediated driver core.  It's up to the
> vendor driver to define that relation and moving these to a common
> header is clearly too dangerous.  I'm sorry if I've said otherwise in
> the past, but I've only recently discovered a userspace driver (DPDK)
> copying these defines and ignoring the index offsets reported through
> the REGION_INFO API.  So I'm now bitterly aware how an internal
> implementation detail can be abused and if we don't catch them, it's
> going to lock us into an implementation that was designed to be
> flexible.  Thanks,
> 
> Alex
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html