Re: [Qemu-devel] RFC [v2]: vfio / device assignment -- layout of device fd files

Alexander Graf <agraf@xxxxxxx> · Wed, 28 Sep 2011 10:58:26 +0200

On 28.09.2011, at 04:40, Alex Williamson wrote:

> On Tue, 2011-09-27 at 16:28 -0500, Scott Wood wrote:
>> On 09/26/2011 07:45 PM, Alex Williamson wrote:
>>> On Mon, 2011-09-26 at 18:59 -0500, Scott Wood wrote:
>>>> On 09/26/2011 01:34 PM, Alex Williamson wrote:
>>>>> /* Reset the device */
>>>>> #define VFIO_DEVICE_RESET			_IO(, ,)
>>>> 
>>>> What generic way do we have to do this?  We should probably have a way
>>>> to determine whether it's possible, without actually asking to do it.
>>> 
>>> It's not generic, it could be a VFIO_DEVICE_PCI_RESET or we could add a
>>> bit to the device flags to indicate if it's available or we could add a
>>> "probe" arg to the ioctl to either check for existence or do it.
>> 
>> Even with PCI, isn't this only possible if function-level reset is
>> supported?
> 
> There are a couple other things we can do if FLR isn't present (D3hot
> transition, secondary bus reset, device specific resets are possible).
> 
>> I think we need a flag.
> 
> Ok, PCI has a pci_probe_reset_function() and pci_reset_function().  I'd
> probably mimic those in the vfio device ops.  Common vfio code can probe
> the reset and set the flag appropriately and we can have a common
> VFIO_DEVICE_RESET ioctl that calls into the device ops reset function.
> 
>> For devices that can't be reset by the kernel, we'll want the ability to
>> stop/start DMA acccess through the IOMMU (or other bus-specific means),
>> separate from whether the fd is open.  If a device is assigned to a
>> partition and that partition gets reset, we'll want to disable DMA
>> before we re-use the memory, and enable it after the partition has reset
>> or quiesced the device (which requires the fd to be open).
> 
> Maybe this can be accomplished via an iommu_detach_device() to
> temporarily disassociate it from the domain.  We could also unmap all
> the DMA.  Anyway, a few possibilities.
> 
>>>>> /* PCI MSI setup, arg[0] = #, arg[1-n] = eventfds */
>>>>> #define VFIO_DEVICE_PCI_SET_MSI_EVENTFDS	_IOW(, , int)
>>>>> #define VFIO_DEVICE_PCI_SET_MSIX_EVENTFDS	_IOW(, , int)
>>>>> 
>>>>> Hope that covers it.
>>>> 
>>>> It could be done this way, but I predict that the code (both kernel and
>>>> user side) will be larger.  Maybe not much more complex, but more
>>>> boilerplate.
>>>> 
>>>> How will you manage extensions to the interface?
>>> 
>>> I would assume we'd do something similar to the kvm capabilities checks.
>> 
>> This information is already built into the data-structure approach.
> 
> If we define it to be part of the flags, then it's built-in to the ioctl
> approach too...
> 
>>>> The table should not be particularly large, and you'll need to keep the
>>>> information around in some form regardless.  Maybe in the PCI case you
>>>> could produce it dynamically (though I probably wouldn't), but it really
>>>> wouldn't make sense in the device tree case.
>>> 
>>> It would be entirely dynamic for PCI, there's no advantage to caching
>>> it.  Even for device tree, if you can't fetch it dynamically, you'd have
>>> to duplicate it between an internal data structure and a buffer reading
>>> the table.
>> 
>> I don't think we'd need to keep the device tree path/index info around
>> for anything but the table -- but really, this is a minor consideration.
>> 
>>>> You also lose the ability to easily have a human look at the hexdump for
>>>> debugging; you'll need a special "lsvfio" tool.  You might want one
>>>> anyway to pretty-print the info, but with ioctls it's mandatory.
>>> 
>>> I don't think this alone justifies duplicating the data and making it
>>> difficult to parse on both ends.  Chances are we won't need such a tool
>>> for the ioctl interface because it's easier to get it right the first
>>> time ;)
>> 
>> It's not just useful for getting the code right, but for e.g. sanity
>> checking that the devices were bound properly.  I think such a tool
>> would be generally useful, no matter what the kernel interface ends up
>> being.  I don't just use lspci to debug the PCI subsystem. :-)
> 
> This is also a minor consideration.  Looking at hexdumps isn't much to
> rely on for debugging and if we take the step of writing a tool, it's
> not much harder to write for either interface.  The table is more akin
> to dumping the data, but I feel the ioctl is easier for how a driver
> would probably make use of the data (linear vs random access).
> 
>>> Note that I'm not stuck on this interface, I was just thinking about how
>>> to generate the table last week, it seemed like a pain so I thought I'd
>>> spend a few minutes outlining an ioctl interface... turns out it's not
>>> so bad.  Thanks,
>> 
>> Yeah, it can work either way, as long as the information's there and
>> there's a way to add new bits of information, or new bus types, down the
>> road.  Mainly a matter of aesthetics between the two.
> 
> It'd be nice if David would chime back in since he objected to the
> table.  Does an ioctl interface look better?  Alex Graf, any opinions?

I'm honestly pretty indifferent on ioctl vs. linear read. I got the impression that people dislike ioctls for whatever reason, so we went ahead and did the design based on read(). With KVM, ioctls are a constant pain to extend, but so are the constant sized fields here.

Whatever you do, please introduce a "flags" field to every struct you use and add some padding at the end, so it can possibly be extended.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html