Re: [Qemu-devel] [RFC PATCH] vfio/pci: Add ioeventfd support

Auger Eric <eric.auger@xxxxxxxxxx> · Thu, 8 Feb 2018 14:48:13 +0100

Hi Alex,

On 07/02/18 17:57, Alex Williamson wrote:
> On Wed, 7 Feb 2018 16:46:19 +0100
> Auger Eric <eric.auger@xxxxxxxxxx> wrote:
> 
>> Hi Alex,
>>
>> On 07/02/18 01:08, Alex Williamson wrote:
>>> The ioeventfd here is actually irqfd handling of an ioeventfd such as
>>> supported in KVM.  A user is able to pre-program a device write to
>>> occur when the eventfd triggers.  This is yet another instance of
>>> eventfd-irqfd triggering between KVM and vfio.  The impetus for this
>>> is high frequency writes to pages which are virtualized in QEMU.
>>> Enabling this near-direct write path for selected registers within
>>> the virtualized page can improve performance and reduce overhead.
>>> Specifically this is initially targeted at NVIDIA graphics cards where
>>> the driver issues a write to an MMIO register within a virtualized
>>> region in order to allow the MSI interrupt to re-trigger.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>  
>>
>> fyi it does not apply anymore on master (conflict in
>> include/uapi/linux/vfio.h on GFX stuff)
> 
> Sorry, I should have noted that this was against v4.15, I didn't want
> the churn of the merge window since I was benchmarking against this.
> Will update for non-RFC.
> 
> ...
>>> +long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset,
>>> +			uint64_t data, int count, int fd)
>>> +{
>>> +	struct pci_dev *pdev = vdev->pdev;
>>> +	loff_t pos = offset & VFIO_PCI_OFFSET_MASK;
>>> +	int ret, bar = VFIO_PCI_OFFSET_TO_INDEX(offset);
>>> +	struct vfio_pci_ioeventfd *ioeventfd;
>>> +	int (*handler)(void *, void *);
>>> +	unsigned long val;
>>> +
>>> +	/* Only support ioeventfds into BARs */
>>> +	if (bar > VFIO_PCI_BAR5_REGION_INDEX)
>>> +		return -EINVAL;
>>> +
>>> +	if (pos + count > pci_resource_len(pdev, bar))
>>> +		return -EINVAL;
>>> +
>>> +	/* Disallow ioeventfds working around MSI-X table writes */
>>> +	if (bar == vdev->msix_bar &&
>>> +	    !(pos + count <= vdev->msix_offset ||
>>> +	      pos >= vdev->msix_offset + vdev->msix_size))
>>> +		return -EINVAL;
>>> +
>>> +	switch (count) {
>>> +	case 1:
>>> +		handler = &vfio_pci_ioeventfd_handler8;
>>> +		val = data;
>>> +		break;
>>> +	case 2:
>>> +		handler = &vfio_pci_ioeventfd_handler16;
>>> +		val = le16_to_cpu(data);
>>> +		break;
>>> +	case 4:
>>> +		handler = &vfio_pci_ioeventfd_handler32;
>>> +		val = le32_to_cpu(data);
>>> +		break;
>>> +#ifdef iowrite64
>>> +	case 8:
>>> +		handler = &vfio_pci_ioeventfd_handler64;
>>> +		val = le64_to_cpu(data);
>>> +		break;
>>> +#endif
>>> +	default:
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	ret = vfio_pci_setup_barmap(vdev, bar);
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	mutex_lock(&vdev->ioeventfds_lock);
>>> +
>>> +	list_for_each_entry(ioeventfd, &vdev->ioeventfds_list, next) {
>>> +		if (ioeventfd->pos == pos && ioeventfd->bar == bar &&
>>> +		    ioeventfd->data == data && ioeventfd->count == count) {
>>> +			if (fd == -1) {
>>> +				vfio_virqfd_disable(&ioeventfd->virqfd);
>>> +				list_del(&ioeventfd->next);
>>> +				kfree(ioeventfd);
>>> +				ret = 0;
>>> +			} else
>>> +				ret = -EEXIST;
>>> +
>>> +			goto out_unlock;
>>> +		}
>>> +	}
>>> +
>>> +	if (fd < 0) {
>>> +		ret = -ENODEV;
>>> +		goto out_unlock;
>>> +	}
>>> +
>>> +	ioeventfd = kzalloc(sizeof(*ioeventfd), GFP_KERNEL);
>>> +	if (!ioeventfd) {
>>> +		ret = -ENOMEM;
>>> +		goto out_unlock;
>>> +	}
>>> +
>>> +	ioeventfd->pos = pos;
>>> +	ioeventfd->bar = bar;
>>> +	ioeventfd->data = data;
>>> +	ioeventfd->count = count;
>>> +
>>> +	ret = vfio_virqfd_enable(vdev->barmap[ioeventfd->bar] + ioeventfd->pos,  
>> nit: bar and pos could be used directly
> 
> Indeed, probably leftover from development.  Fixed and re-wrapped the
> following lines.
> 
>>> +				 handler, NULL, (void *)val,
>>> +				 &ioeventfd->virqfd, fd);
>>> +	if (ret) {
>>> +		kfree(ioeventfd);
>>> +		goto out_unlock;
>>> +	}
>>> +
>>> +	list_add(&ioeventfd->next, &vdev->ioeventfds_list);
>>> +
>>> +out_unlock:
>>> +	mutex_unlock(&vdev->ioeventfds_lock);
>>> +
>>> +	return ret;
>>> +}
>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>> index e3301dbd27d4..07966a5f0832 100644
>>> --- a/include/uapi/linux/vfio.h
>>> +++ b/include/uapi/linux/vfio.h
>>> @@ -503,6 +503,30 @@ struct vfio_pci_hot_reset {
>>>  
>>>  #define VFIO_DEVICE_PCI_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 13)
>>>  
>>> +/**
>>> + * VFIO_DEVICE_IOEVENTFD - _IOW(VFIO_TYPE, VFIO_BASE + 14,
>>> + *                              struct vfio_device_ioeventfd)
>>> + *
>>> + * Perform a write to the device at the specified device fd offset, with
>>> + * the specified data and width when the provided eventfd is triggered.  
>> don't you want to document the limitation on BAR only and excluding the
>> MSIx table.
> 
> Sure, we could.  This is also just an acceleration interface, it's not
> required for functionality, so a user can probe the capabilities by
> trying to enable an ioeventfd to test for support.  I don't really want
> to add a flag to each region to identify support or create yet another
> sparse map identifying which sections of regions are supported.  We
> could enable this for PCI config space, but I didn't think it really
> made sense (and I was too lazy).  Real PCI config space (not MMIO
> mirrors of config space on GPUs) should never be a performance path,
> therefore I question if there's any ROI for the code to support it.
> Device specific regions would need to be handled on a case by case
> basis, and I don't think we have any cases yet were it makes sense
> (IIRC the IGD regions are read-only).  Of course ROMs are read-only, so
> it doesn't make sense to support them.
> 
> I also note that this patch of course only supports directly assigned
> vfio-pci devices, not vfio-platform and not mdev devices.  Since the
> ioctl is intended to be device agnostic, maybe we could have a default
> handler that simply uses the device file write interface.  At least one
> issue there is that those expect a user buffer.  I'll look into how I
> might add the support more generically, if perhaps less optimized.

OK

> 
> Does it seem like a sufficient user API to try and ioeventfd and be
> prepared for it to fail?  Given that this is a new ioctl, users need to
> do that for backwards compatibility anyway.

looks OK to me.
> 
> Something I forgot to mention in my rush to send this out is that I
> debated whether to create a new ioctl or re-use the SET_IRQS ioctl.  In
> the end, I couldn't resolve how to handle the index, start, and count
> fields, so I created this new ioctl.  Would we create an ioeventfd
> index?  We couldn't simply pass an array of ioeventfds since each needs
> to be associated with an offset, data, and size, so we'd need a new
> data type with a structure encompassing those fields.  In the end it
> obviously didn't make sense to me to re-use SET_IRQS, but if others can
> come up with something that makes sense, I'm open to it.  Thanks,

as far as I am concerned, I prefer this API rather than the SET_IRQS one ;-)

Thanks

Eric
> 
> Alex
>