On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote: > > Thanks Konrad! Comments inline. > > On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote: > > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote: > > > VFIO provides a secure, IOMMU based interface for user space > > > drivers, including device assignment to virtual machines. > > > This provides the base management of IOMMU groups, devices, > > > and IOMMU objects. See Documentation/vfio.txt included in > > > this patch for user and kernel API description. > > > > > > Note, this implements the new API discussed at KVM Forum > > > 2011, as represented by the drvier version 0.2. It's hoped > > > that this provides a modular enough interface to support PCI > > > and non-PCI userspace drivers across various architectures > > > and IOMMU implementations. > > > > > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > > > --- > > > > > > Fingers crossed, this is the last RFC for VFIO, but we need > > > the iommu group support before this can go upstream > > > (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html), > > > hoping this helps push that along. > > > > > > Since the last posting, this version completely modularizes > > > the device backends and better defines the APIs between the > > > core VFIO code and the device backends. I expect that we > > > might also adopt a modular IOMMU interface as iommu_ops learns > > > about different types of hardware. Also many, many cleanups. > > > Check the complete git history for details: > > > > > > git://github.com/awilliam/linux-vfio.git vfio-ng > > > > > > (matching qemu tree: git://github.com/awilliam/qemu-vfio.git) > > > > > > This version, along with the supporting VFIO PCI backend can > > > be found here: > > > > > > git://github.com/awilliam/linux-vfio.git vfio-next-20111103 > > > > > > I've held off on implementing a kernel->user signaling > > > mechanism for now since the previous netlink version produced > > > too many gag reflexes. It's easy enough to set a bit in the > > > group flags too indicate such support in the future, so I > > > think we can move ahead without it. > > > > > > Appreciate any feedback or suggestions. Thanks, > > > > > > Alex > > > > > > Documentation/ioctl/ioctl-number.txt | 1 > > > Documentation/vfio.txt | 304 +++++++++ > > > MAINTAINERS | 8 > > > drivers/Kconfig | 2 > > > drivers/Makefile | 1 > > > drivers/vfio/Kconfig | 8 > > > drivers/vfio/Makefile | 3 > > > drivers/vfio/vfio_iommu.c | 530 ++++++++++++++++ > > > drivers/vfio/vfio_main.c | 1151 ++++++++++++++++++++++++++++++++++ > > > drivers/vfio/vfio_private.h | 34 + > > > include/linux/vfio.h | 155 +++++ > > > 11 files changed, 2197 insertions(+), 0 deletions(-) > > > create mode 100644 Documentation/vfio.txt > > > create mode 100644 drivers/vfio/Kconfig > > > create mode 100644 drivers/vfio/Makefile > > > create mode 100644 drivers/vfio/vfio_iommu.c > > > create mode 100644 drivers/vfio/vfio_main.c > > > create mode 100644 drivers/vfio/vfio_private.h > > > create mode 100644 include/linux/vfio.h > > > > > > diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt > > > index 54078ed..59d01e4 100644 > > > --- a/Documentation/ioctl/ioctl-number.txt > > > +++ b/Documentation/ioctl/ioctl-number.txt > > > @@ -88,6 +88,7 @@ Code Seq#(hex) Include File Comments > > > and kernel/power/user.c > > > '8' all SNP8023 advanced NIC card > > > <mailto:mcr@xxxxxxxxxxx> > > > +';' 64-76 linux/vfio.h > > > '@' 00-0F linux/radeonfb.h conflict! > > > '@' 00-0F drivers/video/aty/aty128fb.c conflict! > > > 'A' 00-1F linux/apm_bios.h conflict! > > > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt > > > new file mode 100644 > > > index 0000000..5866896 > > > --- /dev/null > > > +++ b/Documentation/vfio.txt > > > @@ -0,0 +1,304 @@ > > > +VFIO - "Virtual Function I/O"[1] > > > +------------------------------------------------------------------------------- > > > +Many modern system now provide DMA and interrupt remapping facilities > > > +to help ensure I/O devices behave within the boundaries they've been > > > +allotted. This includes x86 hardware with AMD-Vi and Intel VT-d as > > > +well as POWER systems with Partitionable Endpoints (PEs) and even > > > +embedded powerpc systems (technology name unknown). The VFIO driver > > > +is an IOMMU/device agnostic framework for exposing direct device > > > +access to userspace, in a secure, IOMMU protected environment. In > > > +other words, this allows safe, non-privileged, userspace drivers. > > > + > > > +Why do we want that? Virtual machines often make use of direct device > > > +access ("device assignment") when configured for the highest possible > > > +I/O performance. From a device and host perspective, this simply turns > > > +the VM into a userspace driver, with the benefits of significantly > > > +reduced latency, higher bandwidth, and direct use of bare-metal device > > > +drivers[2]. > > > > Are there any constraints of running a 32-bit userspace with > > a 64-bit kernel and with 32-bit user space drivers? > > Shouldn't be. I'll need to do some testing on that, but it was working > on the previous generation of vfio. <nods> ok .. snip.. > > > +#define VFIO_IOMMU_GET_FLAGS _IOR(';', 105, __u64) > > > + #define VFIO_IOMMU_FLAGS_MAP_ANY (1 << 0) > > > +#define VFIO_IOMMU_MAP_DMA _IOWR(';', 106, struct vfio_dma_map) > > > +#define VFIO_IOMMU_UNMAP_DMA _IOWR(';', 107, struct vfio_dma_map) > > > > Coherency support is not going to be addressed right? What about sync? > > Say you need to sync CPU to Device address? > > Do we need to expose that to userspace or should the underlying > iommu_ops take care of it? That I am not sure of. I know that the kernel drivers (especially network ones) are riddled with: pci_dma_sync_single_for_cpu(tp->pdev, dma_addr, len, PCI_DMA_FROMDEVICE); skb_copy_from_linear_data(skb, copy_skb->data, len); pci_dma_sync_single_for_device(tp->pdev, dma_addr, len, PCI_DMA_FROMDEVICE); But I think that has come from the fact that the devices are 32-bit so they could not do DMA above 4GB. Hence the bounce buffer usage and the proliferation of pci_dma_sync.. calls to copy the contents to a bounce buffer if neccessary. But IOMMUs seem to deal with devices that can map the full gamma of memory so they are not constrained to that 32-bit or 36-bit, but rather they do the mapping in hardware if neccessary. So I think I just answered the question - which is: No. .. snip.. > > > + __u64 vaddr; /* process virtual addr */ > > > + __u64 dmaaddr; /* desired and/or returned dma address */ > > > + __u64 size; /* size in bytes */ > > > + __u64 flags; > > > +#define VFIO_DMA_MAP_FLAG_WRITE (1 << 0) /* req writeable DMA mem */ > > > +}; > > > + > > > +Current users of VFIO use relatively static DMA mappings, not requiring > > > +high frequency turnover. As new users are added, it's expected that the > > > > Is there a limit to how many DMA mappings can be created? > > Not that I'm aware of for the current AMD-Vi/VT-d implementations. I > suppose iommu_ops would return -ENOSPC if it hit a limit. I added the Not -ENOMEM? Either way, might want to mention that in this nice document. > VFIO_IOMMU_FLAGS_MAP_ANY flag above to try to identify that kind of > restriction. .. snip.. > > > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports: > > > + > > > +#define VFIO_DEVICE_GET_NUM_REGIONS _IOR(';', 109, int) > > > > Don't want __u32? > > It could be, not sure if it buys us anything maybe even restricts us. > We likely don't need 2^32 regions (famous last words?), so we could > later define <0 to something? OK. > > > > + > > > +Regions are described by a struct vfio_region_info, which is retrieved by > > > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to > > > +the desired region (0 based index). Note that devices may implement zero > > > > > +sized regions (vfio-pci does this to provide a 1:1 BAR to region index > > > +mapping). > > > > Huh? > > PCI has the following static mapping: > > enum { > VFIO_PCI_BAR0_REGION_INDEX, > VFIO_PCI_BAR1_REGION_INDEX, > VFIO_PCI_BAR2_REGION_INDEX, > VFIO_PCI_BAR3_REGION_INDEX, > VFIO_PCI_BAR4_REGION_INDEX, > VFIO_PCI_BAR5_REGION_INDEX, > VFIO_PCI_ROM_REGION_INDEX, > VFIO_PCI_CONFIG_REGION_INDEX, > VFIO_PCI_NUM_REGIONS > }; > > So 8 regions are always reported regardless of whether the device > implements all the BARs and the ROM. Then we have a fixed bar:index > mapping so we don't have to create a region_info field to describe the > bar number for the index. OK. Is that a problem if the real device actually has a zero sized BAR? Or is zero sized BAR in PCI spec equal to "disabled, not in use" ? Just wondering whether (-1ULL) should be used instead? (Which seems the case in QEMU code). > > > > + > > > +struct vfio_region_info { > > > + __u32 len; /* length of structure */ > > > + __u32 index; /* region number */ > > > + __u64 size; /* size in bytes of region */ > > > + __u64 offset; /* start offset of region */ > > > + __u64 flags; > > > +#define VFIO_REGION_INFO_FLAG_MMAP (1 << 0) > > > +#define VFIO_REGION_INFO_FLAG_RO (1 << 1) > > > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID (1 << 2) > > > > What is FLAG_MMAP? Does it mean: 1) it can be mmaped, or 2) it is mmaped? > > Supports mmap > > > FLAG_RO is pretty obvious - presumarily this is for firmware regions and such. > > And PHYS_VALID is if the region is disabled for some reasons? If so > > would the name FLAG_DISABLED be better? > > No, POWER guys have some need to report the host physical address of the > region, so the flag indicates whether the below field is present and > valid. I'll clarify these in the docs. Thanks. .. snip.. > > > +struct vfio_irq_info { > > > + __u32 len; /* length of structure */ > > > + __u32 index; /* IRQ number */ > > > + __u32 count; /* number of individual IRQs */ > > > + __u64 flags; > > > +#define VFIO_IRQ_INFO_FLAG_LEVEL (1 << 0) > > > +}; > > > + > > > +Again, zero count entries are allowed (vfio-pci uses a static interrupt > > > +type to index mapping). > > > > I am not really sure what that means. > > This is so PCI can expose: > > enum { > VFIO_PCI_INTX_IRQ_INDEX, > VFIO_PCI_MSI_IRQ_INDEX, > VFIO_PCI_MSIX_IRQ_INDEX, > VFIO_PCI_NUM_IRQS > }; > > So like regions it always exposes 3 IRQ indexes where count=0 if the > device doesn't actually support that type of interrupt. I just want to > spell out that bus drivers have this kind of flexibility. I think you should change the comment that says 'IRQ number', as the first thing that comes in my mind is 'GSI' or MSI/MSI-x vector. Perhaps '/* index to be used with return value from GET_NUM_IRQS ioctl. Order of structures can be unsorted. */ > > > > + > > > +Information about each index can be retrieved using the GET_IRQ_INFO > > > +ioctl, used much like GET_REGION_INFO. > > > + > > > +#define VFIO_DEVICE_GET_IRQ_INFO _IOWR(';', 112, struct vfio_irq_info) > > > + > > > +Individual indexes can describe single or sets of IRQs. This provides the > > > +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface. > > > + > > > +All VFIO interrupts are signaled to userspace via eventfds. Integer arrays, > > > +as shown below, are used to pass the IRQ info index, the number of eventfds, > > > +and each eventfd to be signaled. Using a count of 0 disables the interrupt. > > > + > > > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */ > > > > Are eventfds u64 or u32? > > int, they're just file descriptors > > > Why not just define a structure? > > struct vfio_irq_eventfds { > > __u32 index; > > __u32 count; > > __u64 eventfds[0] > > }; > > We could do that if preferred. Hmm, are we then going to need > size/flags? Sure. > > > How do you get an eventfd to feed in here? > > eventfd(2), in qemu event_notifier_init() -> event_notifier_get_fd() > > > > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS _IOW(';', 113, int) > > > > u32? > > Not here, it's an fd, so should be an int. > > > > + > > > +When a level triggered interrupt is signaled, the interrupt is masked > > > +on the host. This prevents an unresponsive userspace driver from > > > +continuing to interrupt the host system. After servicing the interrupt, > > > +UNMASK_IRQ is used to allow the interrupt to retrigger. Note that level > > > +triggered interrupts implicitly have a count of 1 per index. > > > > So they are enabled automatically? Meaning you don't even hav to do > > SET_IRQ_EVENTFDS b/c the count is set to 1? > > I suppose that should be "no more than 1 per index" (ie. PCI would > report a count of 0 for VFIO_PCI_INTX_IRQ_INDEX if the device doesn't > support INTx). I think you might be confusing VFIO_DEVICE_GET_IRQ_INFO > which tells how many are available with VFIO_DEVICE_SET_IRQ_EVENTFDS > which does the enabling/disabling. All interrupts are disabled by > default because userspace needs to give us a way to signal them via > eventfds. It will be device dependent whether multiple index can be > enabled simultaneously. Hmm, is that another flag on the irq_info > struct or do we expect drivers to implicitly have that kind of > knowledge? Right, that was what I was wondering. Not sure how the PowerPC world works with this. > > > > + > > > +/* Unmask IRQ index, arg[0] = index */ > > > +#define VFIO_DEVICE_UNMASK_IRQ _IOW(';', 114, int) > > > > So this is for MSI as well? So if I've an index = 1, with count = 4, > > and doing unmaks IRQ will chip enable all the MSI event at once? > > No, this is only for re-enabling level triggered interrupts as discussed > above. Edge triggered interrupts like MSI don't need an unmask... we > may want to do something to accelerate the MSI-X table access for > masking specific interrupts, but I figured that would need to be PCI > aware since those are PCI features, and would therefore be some future > extension of the PCI bus driver and exposed via VFIO_DEVICE_GET_FLAGS. OK. > > > I guess there is not much point in enabling/disabling selective MSI > > IRQs.. > > Some older OSes are said to make extensive use of masking for MSI, so we > probably want this at some point. I'm assuming future PCI extension for > now. > > > > + > > > +Level triggered interrupts can also be unmasked using an irqfd. Use > > > > irqfd or eventfd? > > irqfd is an eventfd in reverse. eventfd = kernel signals userspace via > an fd, irqfd = userspace signals kernel via an fd. Ah neat. > > > > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this. > > > > So only level triggered? Hmm, how do I know whether the device is > > level or edge? Or is that edge (MSI) can also be unmaked using the > > eventfs > > Yes, only for level. Isn't a device going to know what type of > interrupt it uses? MSI masking is PCI specific, not handled by this. I certainly hope it knows, but you know buggy drivers do exist. What would be the return value if somebody tried to unmask an edge one? Should that be documented here? -ENOSPEC? > > > > + > > > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */ > > > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(';', 115, int) > > > + > > > +When supported, as indicated by the device flags, reset the device. > > > + > > > +#define VFIO_DEVICE_RESET _IO(';', 116) > > > > Does it disable the 'count'? Err, does it disable the IRQ on the > > device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS > > to set new eventfds? Or does it re-use the eventfds and the device > > is enabled after this? > > It doesn't affect the interrupt programming. Should it? I would hope not, but I am trying to think of ways one could screw this up. Perhaps just saying that - "No need to call VFIO_DEVICE_SET_IRQ_EVENTFDS as the kernel (and the device) will retain the interrupt.". .. snip.. > > I am not really sure what this section purpose is? Could this be part > > of the header file or the code? It does not look to be part of the > > ioctl API? > > We've passed into the "VFIO bus driver API" section of the document, to > explain the interaction between vfio-core and vfio bus drivers. Perhaps a different file? .. large snip .. > > > + > > > + mutex_lock(&iommu->dgate); > > > + list_for_each_safe(pos, pos2, &iommu->dm_list) { > > > + mlp = list_entry(pos, struct dma_map_page, list); > > > + vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr); > > > > Uh, so if it did not get put_page() we would try to still delete it? > > Couldn't that lead to corruption as the 'mlp' is returned to the poll? > > > > Ah wait, the put_page is on the DMA page, so it is OK to > > delete the tracking structure. It will be just a leaked page. > > Assume you're referencing this chunk: > > vfio_dma_unmap > __vfio_dma_unmap > ... > pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT; > if (pfn) { > iommu_unmap(iommu->domain, iova, 0); > unlocked += put_pfn(pfn, rdwr); > } > > So we skip things that aren't mapped in the iommu, but anything not > mapped should have already been put (failed vfio_dma_map). If it is > mapped, we put it if we originally got it via get_user_pages_fast. > unlocked would only not get incremented here if it was an mmap'd page > (such as the mmap of an mmio space of another vfio device), via the code > in vaddr_get_pfn (stolen from KVM). Yup. Sounds right. .. snip.. > > > +module_param(allow_unsafe_intrs, int, 0); > > > > S_IRUGO ? > > I actually intended that to be S_IRUGO | S_IWUSR just like the kvm > parameter so it can be toggled runtime. OK. > > > > +MODULE_PARM_DESC(allow_unsafe_intrs, > > > + "Allow use of IOMMUs which do not support interrupt remapping"); > > > + > > > +static struct vfio { > > > + dev_t devt; > > > + struct cdev cdev; > > > + struct list_head group_list; > > > + struct mutex lock; > > > + struct kref kref; > > > + struct class *class; > > > + struct idr idr; > > > + wait_queue_head_t release_q; > > > +} vfio; > > > > You probably want to move this below the 'vfio_group' > > as vfio contains the vfio_group. > > Only via the group_list. Are you suggesting for readability or to avoid > forward declarations (which we don't need between these two with current > ordering). Just for readability. > > > > + > > > +static const struct file_operations vfio_group_fops; > > > +extern const struct file_operations vfio_iommu_fops; > > > + > > > +struct vfio_group { > > > + dev_t devt; > > > + unsigned int groupid; > > > + struct bus_type *bus; > > > + struct vfio_iommu *iommu; > > > + struct list_head device_list; > > > + struct list_head iommu_next; > > > + struct list_head group_next; > > > + int refcnt; > > > +}; > > > + > > > +struct vfio_device { > > > + struct device *dev; > > > + const struct vfio_device_ops *ops; > > > + struct vfio_iommu *iommu; > > > + struct vfio_group *group; > > > + struct list_head device_next; > > > + bool attached; > > > + int refcnt; > > > + void *device_data; > > > +}; > > > > And perhaps move this above vfio_group. As vfio_group > > contains a list of these structures? > > These are inter-linked, so chicken and egg. The current ordering is > more function based than definition based. struct vfio is the highest > level object, groups are next, iommus and devices are next, but we need > to share iommus with the other file, so that lands in the header. Ah, OK. > > > > + > > > +/* > > > + * Helper functions called under vfio.lock > > > + */ > > > + > > > +/* Return true if any devices within a group are opened */ > > > +static bool __vfio_group_devs_inuse(struct vfio_group *group) > > > +{ > > > + struct list_head *pos; > > > + > > > + list_for_each(pos, &group->device_list) { > > > + struct vfio_device *device; > > > + > > > + device = list_entry(pos, struct vfio_device, device_next); > > > + if (device->refcnt) > > > + return true; > > > + } > > > + return false; > > > +} > > > + > > > +/* Return true if any of the groups attached to an iommu are opened. > > > + * We can only tear apart merged groups when nothing is left open. */ > > > +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu) > > > +{ > > > + struct list_head *pos; > > > + > > > + list_for_each(pos, &iommu->group_list) { > > > + struct vfio_group *group; > > > + > > > + group = list_entry(pos, struct vfio_group, iommu_next); > > > + if (group->refcnt) > > > + return true; > > > + } > > > + return false; > > > +} > > > + > > > +/* An iommu is "in use" if it has a file descriptor open or if any of > > > + * the groups assigned to the iommu have devices open. */ > > > +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu) > > > +{ > > > + struct list_head *pos; > > > + > > > + if (iommu->refcnt) > > > + return true; > > > + > > > + list_for_each(pos, &iommu->group_list) { > > > + struct vfio_group *group; > > > + > > > + group = list_entry(pos, struct vfio_group, iommu_next); > > > + > > > + if (__vfio_group_devs_inuse(group)) > > > + return true; > > > + } > > > + return false; > > > +} > > > + > > > +static void __vfio_group_set_iommu(struct vfio_group *group, > > > + struct vfio_iommu *iommu) > > > +{ > > > + struct list_head *pos; > > > + > > > + if (group->iommu) > > > + list_del(&group->iommu_next); > > > + if (iommu) > > > + list_add(&group->iommu_next, &iommu->group_list); > > > + > > > + group->iommu = iommu; > > > + > > > + list_for_each(pos, &group->device_list) { > > > + struct vfio_device *device; > > > + > > > + device = list_entry(pos, struct vfio_device, device_next); > > > + device->iommu = iommu; > > > + } > > > +} > > > + > > > +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu, > > > + struct vfio_device *device) > > > +{ > > > + BUG_ON(!iommu->domain && device->attached); > > > > Whoa. Heavy hammer there. > > > > Perhaps WARN_ON as you do check it later on. > > I think it's warranted, internal consistency is broken if we have a > device that thinks it's attached to an iommu domain that doesn't exist. > It should, of course, never happen and this isn't a performance path. > > > > + > > > + if (!iommu->domain || !device->attached) > > > + return; Well, the deal is that you BUG_ON earlier, but you check for it here. But the BUG_ON will stop execution , so the check 'if ..' is actually not needed. > > > + > > > + iommu_detach_device(iommu->domain, device->dev); > > > + device->attached = false; > > > +} > > > + > > > +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu, > > > + struct vfio_group *group) > > > +{ > > > + struct list_head *pos; > > > + > > > + list_for_each(pos, &group->device_list) { > > > + struct vfio_device *device; > > > + > > > + device = list_entry(pos, struct vfio_device, device_next); > > > + __vfio_iommu_detach_dev(iommu, device); > > > + } > > > +} > > > + > > > +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu, > > > + struct vfio_device *device) > > > +{ > > > + int ret; > > > + > > > + BUG_ON(device->attached); > > > > How about: > > > > WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register > > the device again! Tell him/her to stop please.\n"); > > I would almost demote this one to a WARN_ON, but userspace isn't in > control of attaching and detaching devices from the iommu. That's a > side effect of getting the iommu or device file descriptor. So again, > this is an internal consistency check and it should never happen, > regardless of userspace. > Ok, then you might want to expand it to BUG_ON(!device || device->attached); In case something has gone horribly wrong. .. snip.. > > > + group->devt = MKDEV(MAJOR(vfio.devt), minor); > > > + device_create(vfio.class, NULL, group->devt, > > > + group, "%u", groupid); > > > + > > > + group->bus = dev->bus; > > > > > > Oh, so that is how the IOMMU iommu_ops get copied! You might > > want to mention that - I was not sure where the 'handoff' is > > was done to insert a device so that it can do iommu_ops properly. > > > > Ok, so the time when a device is detected whether it can do > > IOMMU is when we try to open it - as that is when iommu_domain_alloc > > is called which can return NULL if the iommu_ops is not set. > > > > So what about devices that don't have an iommu_ops? Say they > > are using SWIOTLB? (like the AMD-Vi sometimes does if the > > device is not on its list). > > > > Can we use iommu_present? > > I'm not sure I'm following your revelation ;) Take a look at the I am trying to figure out who sets the iommu_ops call on the devices. > pointer to iommu_device_group I pasted above, or these: > > https://github.com/awilliam/linux-vfio/commit/37dd08c90d149caaed7779d4f38850a8f7ed0fa5 > https://github.com/awilliam/linux-vfio/commit/63ca8543533d8130db23d7949133e548c3891c97 > https://github.com/awilliam/linux-vfio/commit/8d7d70eb8e714fbf8710848a06f8cab0c741631e > > That call includes an iommu_present() check, so if there's no iommu or > the iommu can't provide a groupid, the device is skipped over from vfio > (can't be used). > > So the ordering is: > > - bus driver registers device > - if it has an iommu group, add it to the vfio device/group tracking > > - group gets opened > - user gets iommu or device fd results in iommu_domain_alloc > > Devices without iommu_ops don't get to play in the vfio world. Right, and I think the answer of which devices get iommu_ops is done via bus_set_iommu. (Thinking in long-term of what would be required to make this work with Xen and it sounds like I will need to implement a Xen IOMMU driver) .. snip.. > > > > So where is the vfio-pci? Is that a seperate posting? > > You can find it in the tree pointed to in the patch description: > > https://github.com/awilliam/linux-vfio/commit/534725d327e2b7791a229ce72d2ae8a62ee0a4e5 Thanks. > > I was hoping to get some consensus around the new core before spending > too much time polishing up the bus driver. Thanks for the review, it's > very much appreciated! Sure thing. > > Alex > -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html