Hi Alex, On 07/02/18 17:57, Alex Williamson wrote: > On Wed, 7 Feb 2018 16:46:19 +0100 > Auger Eric <eric.auger@xxxxxxxxxx> wrote: > >> Hi Alex, >> >> On 07/02/18 01:08, Alex Williamson wrote: >>> The ioeventfd here is actually irqfd handling of an ioeventfd such as >>> supported in KVM. A user is able to pre-program a device write to >>> occur when the eventfd triggers. This is yet another instance of >>> eventfd-irqfd triggering between KVM and vfio. The impetus for this >>> is high frequency writes to pages which are virtualized in QEMU. >>> Enabling this near-direct write path for selected registers within >>> the virtualized page can improve performance and reduce overhead. >>> Specifically this is initially targeted at NVIDIA graphics cards where >>> the driver issues a write to an MMIO register within a virtualized >>> region in order to allow the MSI interrupt to re-trigger. >>> >>> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> >> >> fyi it does not apply anymore on master (conflict in >> include/uapi/linux/vfio.h on GFX stuff) > > Sorry, I should have noted that this was against v4.15, I didn't want > the churn of the merge window since I was benchmarking against this. > Will update for non-RFC. > > ... >>> +long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset, >>> + uint64_t data, int count, int fd) >>> +{ >>> + struct pci_dev *pdev = vdev->pdev; >>> + loff_t pos = offset & VFIO_PCI_OFFSET_MASK; >>> + int ret, bar = VFIO_PCI_OFFSET_TO_INDEX(offset); >>> + struct vfio_pci_ioeventfd *ioeventfd; >>> + int (*handler)(void *, void *); >>> + unsigned long val; >>> + >>> + /* Only support ioeventfds into BARs */ >>> + if (bar > VFIO_PCI_BAR5_REGION_INDEX) >>> + return -EINVAL; >>> + >>> + if (pos + count > pci_resource_len(pdev, bar)) >>> + return -EINVAL; >>> + >>> + /* Disallow ioeventfds working around MSI-X table writes */ >>> + if (bar == vdev->msix_bar && >>> + !(pos + count <= vdev->msix_offset || >>> + pos >= vdev->msix_offset + vdev->msix_size)) >>> + return -EINVAL; >>> + >>> + switch (count) { >>> + case 1: >>> + handler = &vfio_pci_ioeventfd_handler8; >>> + val = data; >>> + break; >>> + case 2: >>> + handler = &vfio_pci_ioeventfd_handler16; >>> + val = le16_to_cpu(data); >>> + break; >>> + case 4: >>> + handler = &vfio_pci_ioeventfd_handler32; >>> + val = le32_to_cpu(data); >>> + break; >>> +#ifdef iowrite64 >>> + case 8: >>> + handler = &vfio_pci_ioeventfd_handler64; >>> + val = le64_to_cpu(data); >>> + break; >>> +#endif >>> + default: >>> + return -EINVAL; >>> + } >>> + >>> + ret = vfio_pci_setup_barmap(vdev, bar); >>> + if (ret) >>> + return ret; >>> + >>> + mutex_lock(&vdev->ioeventfds_lock); >>> + >>> + list_for_each_entry(ioeventfd, &vdev->ioeventfds_list, next) { >>> + if (ioeventfd->pos == pos && ioeventfd->bar == bar && >>> + ioeventfd->data == data && ioeventfd->count == count) { >>> + if (fd == -1) { >>> + vfio_virqfd_disable(&ioeventfd->virqfd); >>> + list_del(&ioeventfd->next); >>> + kfree(ioeventfd); >>> + ret = 0; >>> + } else >>> + ret = -EEXIST; >>> + >>> + goto out_unlock; >>> + } >>> + } >>> + >>> + if (fd < 0) { >>> + ret = -ENODEV; >>> + goto out_unlock; >>> + } >>> + >>> + ioeventfd = kzalloc(sizeof(*ioeventfd), GFP_KERNEL); >>> + if (!ioeventfd) { >>> + ret = -ENOMEM; >>> + goto out_unlock; >>> + } >>> + >>> + ioeventfd->pos = pos; >>> + ioeventfd->bar = bar; >>> + ioeventfd->data = data; >>> + ioeventfd->count = count; >>> + >>> + ret = vfio_virqfd_enable(vdev->barmap[ioeventfd->bar] + ioeventfd->pos, >> nit: bar and pos could be used directly > > Indeed, probably leftover from development. Fixed and re-wrapped the > following lines. > >>> + handler, NULL, (void *)val, >>> + &ioeventfd->virqfd, fd); >>> + if (ret) { >>> + kfree(ioeventfd); >>> + goto out_unlock; >>> + } >>> + >>> + list_add(&ioeventfd->next, &vdev->ioeventfds_list); >>> + >>> +out_unlock: >>> + mutex_unlock(&vdev->ioeventfds_lock); >>> + >>> + return ret; >>> +} >>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h >>> index e3301dbd27d4..07966a5f0832 100644 >>> --- a/include/uapi/linux/vfio.h >>> +++ b/include/uapi/linux/vfio.h >>> @@ -503,6 +503,30 @@ struct vfio_pci_hot_reset { >>> >>> #define VFIO_DEVICE_PCI_HOT_RESET _IO(VFIO_TYPE, VFIO_BASE + 13) >>> >>> +/** >>> + * VFIO_DEVICE_IOEVENTFD - _IOW(VFIO_TYPE, VFIO_BASE + 14, >>> + * struct vfio_device_ioeventfd) >>> + * >>> + * Perform a write to the device at the specified device fd offset, with >>> + * the specified data and width when the provided eventfd is triggered. >> don't you want to document the limitation on BAR only and excluding the >> MSIx table. > > Sure, we could. This is also just an acceleration interface, it's not > required for functionality, so a user can probe the capabilities by > trying to enable an ioeventfd to test for support. I don't really want > to add a flag to each region to identify support or create yet another > sparse map identifying which sections of regions are supported. We > could enable this for PCI config space, but I didn't think it really > made sense (and I was too lazy). Real PCI config space (not MMIO > mirrors of config space on GPUs) should never be a performance path, > therefore I question if there's any ROI for the code to support it. > Device specific regions would need to be handled on a case by case > basis, and I don't think we have any cases yet were it makes sense > (IIRC the IGD regions are read-only). Of course ROMs are read-only, so > it doesn't make sense to support them. > > I also note that this patch of course only supports directly assigned > vfio-pci devices, not vfio-platform and not mdev devices. Since the > ioctl is intended to be device agnostic, maybe we could have a default > handler that simply uses the device file write interface. At least one > issue there is that those expect a user buffer. I'll look into how I > might add the support more generically, if perhaps less optimized. OK > > Does it seem like a sufficient user API to try and ioeventfd and be > prepared for it to fail? Given that this is a new ioctl, users need to > do that for backwards compatibility anyway. looks OK to me. > > Something I forgot to mention in my rush to send this out is that I > debated whether to create a new ioctl or re-use the SET_IRQS ioctl. In > the end, I couldn't resolve how to handle the index, start, and count > fields, so I created this new ioctl. Would we create an ioeventfd > index? We couldn't simply pass an array of ioeventfds since each needs > to be associated with an offset, data, and size, so we'd need a new > data type with a structure encompassing those fields. In the end it > obviously didn't make sense to me to re-use SET_IRQS, but if others can > come up with something that makes sense, I'm open to it. Thanks, as far as I am concerned, I prefer this API rather than the SET_IRQS one ;-) Thanks Eric > > Alex >