On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote: > VFIO provides a secure, IOMMU based interface for user space > drivers, including device assignment to virtual machines. > This provides the base management of IOMMU groups, devices, > and IOMMU objects. See Documentation/vfio.txt included in > this patch for user and kernel API description. > > Note, this implements the new API discussed at KVM Forum > 2011, as represented by the drvier version 0.2. It's hoped > that this provides a modular enough interface to support PCI > and non-PCI userspace drivers across various architectures > and IOMMU implementations. > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx> > --- > > Fingers crossed, this is the last RFC for VFIO, but we need > the iommu group support before this can go upstream > (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html), > hoping this helps push that along. > > Since the last posting, this version completely modularizes > the device backends and better defines the APIs between the > core VFIO code and the device backends. I expect that we > might also adopt a modular IOMMU interface as iommu_ops learns > about different types of hardware. Also many, many cleanups. > Check the complete git history for details: > > git://github.com/awilliam/linux-vfio.git vfio-ng > > (matching qemu tree: git://github.com/awilliam/qemu-vfio.git) > > This version, along with the supporting VFIO PCI backend can > be found here: > > git://github.com/awilliam/linux-vfio.git vfio-next-20111103 > > I've held off on implementing a kernel->user signaling > mechanism for now since the previous netlink version produced > too many gag reflexes. It's easy enough to set a bit in the > group flags too indicate such support in the future, so I > think we can move ahead without it. > > Appreciate any feedback or suggestions. Thanks, > > Alex > > Documentation/ioctl/ioctl-number.txt | 1 > Documentation/vfio.txt | 304 +++++++++ > MAINTAINERS | 8 > drivers/Kconfig | 2 > drivers/Makefile | 1 > drivers/vfio/Kconfig | 8 > drivers/vfio/Makefile | 3 > drivers/vfio/vfio_iommu.c | 530 ++++++++++++++++ > drivers/vfio/vfio_main.c | 1151 ++++++++++++++++++++++++++++++++++ > drivers/vfio/vfio_private.h | 34 + > include/linux/vfio.h | 155 +++++ > 11 files changed, 2197 insertions(+), 0 deletions(-) > create mode 100644 Documentation/vfio.txt > create mode 100644 drivers/vfio/Kconfig > create mode 100644 drivers/vfio/Makefile > create mode 100644 drivers/vfio/vfio_iommu.c > create mode 100644 drivers/vfio/vfio_main.c > create mode 100644 drivers/vfio/vfio_private.h > create mode 100644 include/linux/vfio.h > > diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt > index 54078ed..59d01e4 100644 > --- a/Documentation/ioctl/ioctl-number.txt > +++ b/Documentation/ioctl/ioctl-number.txt > @@ -88,6 +88,7 @@ Code Seq#(hex) Include File Comments > and kernel/power/user.c > '8' all SNP8023 advanced NIC card > <mailto:mcr@xxxxxxxxxxx> > +';' 64-76 linux/vfio.h > '@' 00-0F linux/radeonfb.h conflict! > '@' 00-0F drivers/video/aty/aty128fb.c conflict! > 'A' 00-1F linux/apm_bios.h conflict! > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt > new file mode 100644 > index 0000000..5866896 > --- /dev/null > +++ b/Documentation/vfio.txt > @@ -0,0 +1,304 @@ > +VFIO - "Virtual Function I/O"[1] > +------------------------------------------------------------------------------- > +Many modern system now provide DMA and interrupt remapping facilities > +to help ensure I/O devices behave within the boundaries they've been > +allotted. This includes x86 hardware with AMD-Vi and Intel VT-d as > +well as POWER systems with Partitionable Endpoints (PEs) and even > +embedded powerpc systems (technology name unknown). The VFIO driver > +is an IOMMU/device agnostic framework for exposing direct device > +access to userspace, in a secure, IOMMU protected environment. In > +other words, this allows safe, non-privileged, userspace drivers. > + > +Why do we want that? Virtual machines often make use of direct device > +access ("device assignment") when configured for the highest possible > +I/O performance. From a device and host perspective, this simply turns > +the VM into a userspace driver, with the benefits of significantly > +reduced latency, higher bandwidth, and direct use of bare-metal device > +drivers[2]. Are there any constraints of running a 32-bit userspace with a 64-bit kernel and with 32-bit user space drivers? > + > +Some applications, particularly in the high performance computing > +field, also benefit from low-overhead, direct device access from > +userspace. Examples include network adapters (often non-TCP/IP based) > +and compute accelerators. Previous to VFIO, these drivers needed to > +go through the full development cycle to become proper upstream driver, > +be maintained out of tree, or make use of the UIO framework, which > +has no notion of IOMMU protection, limited interrupt support, and > +requires root privileges to access things like PCI configuration space. > + > +The VFIO driver framework intends to unify these, replacing both the > +KVM PCI specific device assignment currently used as well as provide > +a more secure, more featureful userspace driver environment than UIO. > + > +Groups, Devices, IOMMUs, oh my <chuckles> oh my, eh? > +------------------------------------------------------------------------------- > + > +A fundamental component of VFIO is the notion of IOMMU groups. IOMMUs > +can't always distinguish transactions from each individual device in > +the system. Sometimes this is because of the IOMMU design, such as with > +PEs, other times it's caused by the I/O topology, for instance a > +PCIe-to-PCI bridge masking all devices behind it. We call the sets of > +devices created by these restictions IOMMU groups (or just "groups" for > +this document). > + > +The IOMMU cannot distiguish transactions between the individual devices > +within the group, therefore the group is the basic unit of ownership for > +a userspace process. Because of this, groups are also the primary > +interface to both devices and IOMMU domains in VFIO. > + > +The VFIO representation of groups is created as devices are added into > +the framework by a VFIO bus driver. The vfio-pci module is an example > +of a bus driver. This module registers devices along with a set of bus > +specific callbacks with the VFIO core. These callbacks provide the > +interfaces later used for device access. As each new group is created, > +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP > +character device. > + > +In addition to the device enumeration and callbacks, the VFIO bus driver > +also provides a traditional device driver and is able to bind to devices > +on it's bus. When a device is bound to the bus driver it's available to > +VFIO. When all the devices within a group are bound to their bus drivers, > +the group becomes "viable" and a user with sufficient access to the VFIO > +group chardev can obtain exclusive access to the set of group devices. > + > +As documented in linux/vfio.h, several ioctls are provided on the > +group chardev: > + > +#define VFIO_GROUP_GET_FLAGS _IOR(';', 100, __u64) > + #define VFIO_GROUP_FLAGS_VIABLE (1 << 0) > + #define VFIO_GROUP_FLAGS_MM_LOCKED (1 << 1) > +#define VFIO_GROUP_MERGE _IOW(';', 101, int) > +#define VFIO_GROUP_UNMERGE _IOW(';', 102, int) > +#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103) > +#define VFIO_GROUP_GET_DEVICE_FD _IOW(';', 104, char *) > + > +The last two ioctls return new file descriptors for accessing > +individual devices within the group and programming the IOMMU. Each of > +these new file descriptors provide their own set of file interfaces. > +These ioctls will fail if any of the devices within the group are not > +bound to their VFIO bus driver. Additionally, when either of these > +interfaces are used, the group is then bound to the struct_mm of the > +caller. The GET_FLAGS ioctl can be used to view the state of the group. > + > +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a > +new IOMMU domain is created and all of the devices in the group are > +attached to it. This is the only way to ensure full IOMMU isolation > +of the group, but potentially wastes resources and cycles if the user > +intends to manage multiple groups with the same set of IOMMU mappings. > +VFIO therefore provides a group MERGE and UNMERGE interface, which > +allows multiple groups to share an IOMMU domain. Not all IOMMUs allow > +arbitrary groups to be merged, so the user should assume merging is > +opportunistic. A new group, with no open device or IOMMU file > +descriptors, can be merged into an existing, in-use, group using the > +MERGE ioctl. A merged group can be unmerged using the UNMERGE ioctl > +once all of the device file descriptors for the group being merged > +"out" are closed. > + > +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are > +essentially fungible between group file descriptors (ie. if device A > +is in group X, and X is merged with Y, a file descriptor for A can be > +retrieved using GET_DEVICE_FD on Y. Likewise, GET_IOMMU_FD returns a > +file descriptor referencing the same internal IOMMU object from either > +X or Y). Merged groups can be dissolved either explictly with UNMERGE > +or automatically when ALL file descriptors for the merged group are > +closed (all IOMMUs, all devices, all groups). > + > +The IOMMU file descriptor provides this set of ioctls: > + > +#define VFIO_IOMMU_GET_FLAGS _IOR(';', 105, __u64) > + #define VFIO_IOMMU_FLAGS_MAP_ANY (1 << 0) > +#define VFIO_IOMMU_MAP_DMA _IOWR(';', 106, struct vfio_dma_map) > +#define VFIO_IOMMU_UNMAP_DMA _IOWR(';', 107, struct vfio_dma_map) Coherency support is not going to be addressed right? What about sync? Say you need to sync CPU to Device address? > + > +The GET_FLAGS ioctl returns basic information about the IOMMU domain. > +We currently only support IOMMU domains that are able to map any > +virtual address to any IOVA. This is indicated by the MAP_ANY flag. > + > +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping > +and unmapping IOVAs to process virtual addresses: > + > +struct vfio_dma_map { > + __u64 len; /* length of structure */ What is the purpose of the 'len' field? Is it to guard against future version changes? > + __u64 vaddr; /* process virtual addr */ > + __u64 dmaaddr; /* desired and/or returned dma address */ > + __u64 size; /* size in bytes */ > + __u64 flags; > +#define VFIO_DMA_MAP_FLAG_WRITE (1 << 0) /* req writeable DMA mem */ > +}; > + > +Current users of VFIO use relatively static DMA mappings, not requiring > +high frequency turnover. As new users are added, it's expected that the Is there a limit to how many DMA mappings can be created? > +IOMMU file descriptor will evolve to support new mapping interfaces, this > +will be reflected in the flags and may present new ioctls and file > +interfaces. > + > +The device GET_FLAGS ioctl is intended to return basic device type and > +indicate support for optional capabilities. Flags currently include whether > +the device is PCI or described by Device Tree, and whether the RESET ioctl > +is supported: And reset in terms of PCIe spec is the FLR? > + > +#define VFIO_DEVICE_GET_FLAGS _IOR(';', 108, __u64) > + #define VFIO_DEVICE_FLAGS_PCI (1 << 0) > + #define VFIO_DEVICE_FLAGS_DT (1 << 1) > + #define VFIO_DEVICE_FLAGS_RESET (1 << 2) > + > +The MMIO and IOP resources used by a device are described by regions. IOP? > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports: > + > +#define VFIO_DEVICE_GET_NUM_REGIONS _IOR(';', 109, int) Don't want __u32? > + > +Regions are described by a struct vfio_region_info, which is retrieved by > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to > +the desired region (0 based index). Note that devices may implement zero > +sized regions (vfio-pci does this to provide a 1:1 BAR to region index > +mapping). Huh? > + > +struct vfio_region_info { > + __u32 len; /* length of structure */ > + __u32 index; /* region number */ > + __u64 size; /* size in bytes of region */ > + __u64 offset; /* start offset of region */ > + __u64 flags; > +#define VFIO_REGION_INFO_FLAG_MMAP (1 << 0) > +#define VFIO_REGION_INFO_FLAG_RO (1 << 1) > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID (1 << 2) What is FLAG_MMAP? Does it mean: 1) it can be mmaped, or 2) it is mmaped? FLAG_RO is pretty obvious - presumarily this is for firmware regions and such. And PHYS_VALID is if the region is disabled for some reasons? If so would the name FLAG_DISABLED be better? > + __u64 phys; /* physical address of region */ > +}; > + > +#define VFIO_DEVICE_GET_REGION_INFO _IOWR(';', 110, struct vfio_region_info) > + > +The offset indicates the offset into the device file descriptor which > +accesses the given range (for read/write/mmap/seek). Flags indicate the > +available access types and validity of optional fields. For instance > +the phys field may only be valid for certain devices types. > + > +Interrupts are described using a similar interface. GET_NUM_IRQS > +reports the number or IRQ indexes for the device. > + > +#define VFIO_DEVICE_GET_NUM_IRQS _IOR(';', 111, int) _u32? > + > +struct vfio_irq_info { > + __u32 len; /* length of structure */ > + __u32 index; /* IRQ number */ > + __u32 count; /* number of individual IRQs */ > + __u64 flags; > +#define VFIO_IRQ_INFO_FLAG_LEVEL (1 << 0) > +}; > + > +Again, zero count entries are allowed (vfio-pci uses a static interrupt > +type to index mapping). I am not really sure what that means. > + > +Information about each index can be retrieved using the GET_IRQ_INFO > +ioctl, used much like GET_REGION_INFO. > + > +#define VFIO_DEVICE_GET_IRQ_INFO _IOWR(';', 112, struct vfio_irq_info) > + > +Individual indexes can describe single or sets of IRQs. This provides the > +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface. > + > +All VFIO interrupts are signaled to userspace via eventfds. Integer arrays, > +as shown below, are used to pass the IRQ info index, the number of eventfds, > +and each eventfd to be signaled. Using a count of 0 disables the interrupt. > + > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */ Are eventfds u64 or u32? Why not just define a structure? struct vfio_irq_eventfds { __u32 index; __u32 count; __u64 eventfds[0] }; How do you get an eventfd to feed in here? > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS _IOW(';', 113, int) u32? > + > +When a level triggered interrupt is signaled, the interrupt is masked > +on the host. This prevents an unresponsive userspace driver from > +continuing to interrupt the host system. After servicing the interrupt, > +UNMASK_IRQ is used to allow the interrupt to retrigger. Note that level > +triggered interrupts implicitly have a count of 1 per index. So they are enabled automatically? Meaning you don't even hav to do SET_IRQ_EVENTFDS b/c the count is set to 1? > + > +/* Unmask IRQ index, arg[0] = index */ > +#define VFIO_DEVICE_UNMASK_IRQ _IOW(';', 114, int) So this is for MSI as well? So if I've an index = 1, with count = 4, and doing unmaks IRQ will chip enable all the MSI event at once? I guess there is not much point in enabling/disabling selective MSI IRQs.. > + > +Level triggered interrupts can also be unmasked using an irqfd. Use irqfd or eventfd? > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this. So only level triggered? Hmm, how do I know whether the device is level or edge? Or is that edge (MSI) can also be unmaked using the eventfs > + > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */ > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(';', 115, int) > + > +When supported, as indicated by the device flags, reset the device. > + > +#define VFIO_DEVICE_RESET _IO(';', 116) Does it disable the 'count'? Err, does it disable the IRQ on the device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS to set new eventfds? Or does it re-use the eventfds and the device is enabled after this? > + > +Device tree devices also invlude ioctls for further defining the include > +device tree properties of the device: > + > +struct vfio_dtpath { > + __u32 len; /* length of structure */ > + __u32 index; 0 based I presume? > + __u64 flags; > +#define VFIO_DTPATH_FLAGS_REGION (1 << 0) What is region in this context?? Or would this make much more sense if I knew what Device Tree actually is. > +#define VFIO_DTPATH_FLAGS_IRQ (1 << 1) > + char *path; Ah, now I see why you want 'len' here.. But I am still at loss why you want that with the other structures. > +}; > +#define VFIO_DEVICE_GET_DTPATH _IOWR(';', 117, struct vfio_dtpath) > + > +struct vfio_dtindex { > + __u32 len; /* length of structure */ > + __u32 index; > + __u32 prop_type; Is that an enum type? Is this definied somewhere? > + __u32 prop_index; What is the purpose of this field? > + __u64 flags; > +#define VFIO_DTINDEX_FLAGS_REGION (1 << 0) > +#define VFIO_DTINDEX_FLAGS_IRQ (1 << 1) > +}; > +#define VFIO_DEVICE_GET_DTINDEX _IOWR(';', 118, struct vfio_dtindex) > + > + > +VFIO bus driver API > +------------------------------------------------------------------------------- > + > +Bus drivers, such as PCI, have three jobs: > + 1) Add/remove devices from vfio > + 2) Provide vfio_device_ops for device access > + 3) Device binding and unbinding suspend/resume? > + > +When initialized, the bus driver should enumerate the devices on it's > +bus and call vfio_group_add_dev() for each device. If the bus supports > +hotplug, notifiers should be enabled to track devices being added and > +removed. vfio_group_del_dev() removes a previously added device from > +vfio. > + > +Adding a device registers a vfio_device_ops function pointer structure > +for the device: Huh? So this gets created for _every_ 'struct device' that is added the VFIO bus? Is this structure exposed? Or is this an internal one? > + > +struct vfio_device_ops { > + bool (*match)(struct device *, char *); > + int (*get)(void *); > + void (*put)(void *); > + ssize_t (*read)(void *, char __user *, > + size_t, loff_t *); > + ssize_t (*write)(void *, const char __user *, > + size_t, loff_t *); > + long (*ioctl)(void *, unsigned int, unsigned long); > + int (*mmap)(void *, struct vm_area_struct *); > +}; > + > +When a device is bound to the bus driver, the bus driver indicates this > +to vfio using the vfio_bind_dev() interface. The device_data parameter Might want to paste the function decleration for it.. b/c I am not sure where the 'device_data' parameter is on the argument list. > +is a pointer to an opaque data structure for use only by the bus driver. > +The get, put, read, write, ioctl, and mmap vfio_device_ops all pass > +this data structure back to the bus driver. When a device is unbound Oh, so it is on the 'void *'. > +from the bus driver, the vfio_unbind_dev() interface signals this to > +vfio. This function returns the pointer to the device_data structure That function > +registered for the device. I am not really sure what this section purpose is? Could this be part of the header file or the code? It does not look to be part of the ioctl API? > + > +As noted previously, a group contains one or more devices, so > +GROUP_GET_DEVICE_FD needs to identify the specific device being requested. > +The vfio_device_ops.match callback is used to allow bus drivers to determine > +the match. For drivers like vfio-pci, it's a simple match to dev_name(), > +which is unique in the system due to the PCI bus topology, other bus drivers > +may need to include parent devices to create a unique match, so this is > +left as a bus driver interface. > + > +------------------------------------------------------------------------------- > + > +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's > +initial implementation by Tom Lyon while as Cisco. We've since outgrown > +the acronym, but it's catchy. > + > +[2] As always there are trade-offs to virtual machine device > +assignment that are beyond the scope of VFIO. It's expected that > +future IOMMU technologies will reduce some, but maybe not all, of > +these trade-offs. > diff --git a/MAINTAINERS b/MAINTAINERS > index f05f5f6..4bd5aa0 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -7106,6 +7106,14 @@ S: Maintained > F: Documentation/filesystems/vfat.txt > F: fs/fat/ > > +VFIO DRIVER > +M: Alex Williamson <alex.williamson@xxxxxxxxxx> > +L: kvm@xxxxxxxxxxxxxxx No vfio mailing list? Or a vfio-mailing list? > +S: Maintained > +F: Documentation/vfio.txt > +F: drivers/vfio/ > +F: include/linux/vfio.h > + > VIDEOBUF2 FRAMEWORK > M: Pawel Osciak <pawel@xxxxxxxxxx> > M: Marek Szyprowski <m.szyprowski@xxxxxxxxxxx> > diff --git a/drivers/Kconfig b/drivers/Kconfig > index b5e6f24..e15578b 100644 > --- a/drivers/Kconfig > +++ b/drivers/Kconfig > @@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig" > > source "drivers/uio/Kconfig" > > +source "drivers/vfio/Kconfig" > + > source "drivers/vlynq/Kconfig" > > source "drivers/virtio/Kconfig" > diff --git a/drivers/Makefile b/drivers/Makefile > index 1b31421..5f138b5 100644 > --- a/drivers/Makefile > +++ b/drivers/Makefile > @@ -58,6 +58,7 @@ obj-$(CONFIG_ATM) += atm/ > obj-$(CONFIG_FUSION) += message/ > obj-y += firewire/ > obj-$(CONFIG_UIO) += uio/ > +obj-$(CONFIG_VFIO) += vfio/ > obj-y += cdrom/ > obj-y += auxdisplay/ > obj-$(CONFIG_PCCARD) += pcmcia/ > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig > new file mode 100644 > index 0000000..9acb1e7 > --- /dev/null > +++ b/drivers/vfio/Kconfig > @@ -0,0 +1,8 @@ > +menuconfig VFIO > + tristate "VFIO Non-Privileged userspace driver framework" > + depends on IOMMU_API > + help > + VFIO provides a framework for secure userspace device drivers. > + See Documentation/vfio.txt for more details. > + > + If you don't know what to do here, say N. > diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile > new file mode 100644 > index 0000000..088faf1 > --- /dev/null > +++ b/drivers/vfio/Makefile > @@ -0,0 +1,3 @@ > +vfio-y := vfio_main.o vfio_iommu.o > + > +obj-$(CONFIG_VFIO) := vfio.o > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c > new file mode 100644 > index 0000000..029dae3 > --- /dev/null > +++ b/drivers/vfio/vfio_iommu.c > @@ -0,0 +1,530 @@ > +/* > + * VFIO: IOMMU DMA mapping support > + * > + * Copyright (C) 2011 Red Hat, Inc. All rights reserved. > + * Author: Alex Williamson <alex.williamson@xxxxxxxxxx> > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + * > + * Derived from original vfio: > + * Copyright 2010 Cisco Systems, Inc. All rights reserved. > + * Author: Tom Lyon, pugs@xxxxxxxxx > + */ > + > +#include <linux/compat.h> > +#include <linux/device.h> > +#include <linux/fs.h> > +#include <linux/iommu.h> > +#include <linux/module.h> > +#include <linux/mm.h> > +#include <linux/sched.h> > +#include <linux/slab.h> > +#include <linux/uaccess.h> > +#include <linux/vfio.h> > +#include <linux/workqueue.h> > + > +#include "vfio_private.h" > + > +struct dma_map_page { > + struct list_head list; > + dma_addr_t daddr; > + unsigned long vaddr; > + int npage; > + int rdwr; rdwr? Is this a flag thing? Could it be made in an enum? > +}; > + > +/* > + * This code handles mapping and unmapping of user data buffers > + * into DMA'ble space using the IOMMU > + */ > + > +#define NPAGE_TO_SIZE(npage) ((size_t)(npage) << PAGE_SHIFT) > + > +struct vwork { > + struct mm_struct *mm; > + int npage; > + struct work_struct work; > +}; > + > +/* delayed decrement for locked_vm */ > +static void vfio_lock_acct_bg(struct work_struct *work) > +{ > + struct vwork *vwork = container_of(work, struct vwork, work); > + struct mm_struct *mm; > + > + mm = vwork->mm; > + down_write(&mm->mmap_sem); > + mm->locked_vm += vwork->npage; > + up_write(&mm->mmap_sem); > + mmput(mm); /* unref mm */ > + kfree(vwork); > +} > + > +static void vfio_lock_acct(int npage) > +{ > + struct vwork *vwork; > + struct mm_struct *mm; > + > + if (!current->mm) { > + /* process exited */ > + return; > + } > + if (down_write_trylock(¤t->mm->mmap_sem)) { > + current->mm->locked_vm += npage; > + up_write(¤t->mm->mmap_sem); > + return; > + } > + /* > + * Couldn't get mmap_sem lock, so must setup to decrement > + * mm->locked_vm later. If locked_vm were atomic, we wouldn't > + * need this silliness > + */ > + vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL); > + if (!vwork) > + return; > + mm = get_task_mm(current); /* take ref mm */ > + if (!mm) { > + kfree(vwork); > + return; > + } > + INIT_WORK(&vwork->work, vfio_lock_acct_bg); > + vwork->mm = mm; > + vwork->npage = npage; > + schedule_work(&vwork->work); > +} > + > +/* Some mappings aren't backed by a struct page, for example an mmap'd > + * MMIO range for our own or another device. These use a different > + * pfn conversion and shouldn't be tracked as locked pages. */ > +static int is_invalid_reserved_pfn(unsigned long pfn) static bool > +{ > + if (pfn_valid(pfn)) { > + int reserved; > + struct page *tail = pfn_to_page(pfn); > + struct page *head = compound_trans_head(tail); > + reserved = PageReserved(head); bool reserved = PageReserved(head); > + if (head != tail) { > + /* "head" is not a dangling pointer > + * (compound_trans_head takes care of that) > + * but the hugepage may have been split > + * from under us (and we may not hold a > + * reference count on the head page so it can > + * be reused before we run PageReferenced), so > + * we've to check PageTail before returning > + * what we just read. > + */ > + smp_rmb(); > + if (PageTail(tail)) > + return reserved; > + } > + return PageReserved(tail); > + } > + > + return true; > +} > + > +static int put_pfn(unsigned long pfn, int rdwr) > +{ > + if (!is_invalid_reserved_pfn(pfn)) { > + struct page *page = pfn_to_page(pfn); > + if (rdwr) > + SetPageDirty(page); > + put_page(page); > + return 1; > + } > + return 0; > +} > + > +/* Unmap DMA region */ > +/* dgate must be held */ dgate? > +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova, > + int npage, int rdwr) > +{ > + int i, unlocked = 0; > + > + for (i = 0; i < npage; i++, iova += PAGE_SIZE) { > + unsigned long pfn; > + > + pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT; > + if (pfn) { > + iommu_unmap(iommu->domain, iova, 0); What is the '0' for? Perhaps a comment: /* We only do zero order */ > + unlocked += put_pfn(pfn, rdwr); > + } > + } > + return unlocked; > +} > + > +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova, > + unsigned long npage, int rdwr) > +{ > + int unlocked; > + > + unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr); > + vfio_lock_acct(-unlocked); > +} > + > +/* Unmap ALL DMA regions */ > +void vfio_iommu_unmapall(struct vfio_iommu *iommu) > +{ > + struct list_head *pos, *pos2; pos2 should probably be just called 'tmp' > + struct dma_map_page *mlp; What does 'mlp' stand for? mlp -> dma_page ? > + > + mutex_lock(&iommu->dgate); > + list_for_each_safe(pos, pos2, &iommu->dm_list) { > + mlp = list_entry(pos, struct dma_map_page, list); > + vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr); Uh, so if it did not get put_page() we would try to still delete it? Couldn't that lead to corruption as the 'mlp' is returned to the poll? Ah wait, the put_page is on the DMA page, so it is OK to delete the tracking structure. It will be just a leaked page. > + list_del(&mlp->list); > + kfree(mlp); > + } > + mutex_unlock(&iommu->dgate); > +} > + > +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long *pfn) > +{ > + struct page *page[1]; > + struct vm_area_struct *vma; > + int ret = -EFAULT; > + > + if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) { > + *pfn = page_to_pfn(page[0]); > + return 0; > + } > + > + down_read(¤t->mm->mmap_sem); > + > + vma = find_vma_intersection(current->mm, vaddr, vaddr + 1); > + > + if (vma && vma->vm_flags & VM_PFNMAP) { > + *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; > + if (is_invalid_reserved_pfn(*pfn)) > + ret = 0; Did you mean to break here? > + } > + > + up_read(¤t->mm->mmap_sem); > + > + return ret; > +} > + > +/* Map DMA region */ > +/* dgate must be held */ > +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova, > + unsigned long vaddr, int npage, int rdwr) > +{ > + unsigned long start = iova; > + int i, ret, locked = 0, prot = IOMMU_READ; > + > + /* Verify pages are not already mapped */ I think a 'that' is missing above. > + for (i = 0; i < npage; i++, iova += PAGE_SIZE) > + if (iommu_iova_to_phys(iommu->domain, iova)) > + return -EBUSY; > + > + iova = start; > + > + if (rdwr) > + prot |= IOMMU_WRITE; > + if (iommu->cache) > + prot |= IOMMU_CACHE; > + > + for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) { > + unsigned long pfn = 0; > + > + ret = vaddr_get_pfn(vaddr, rdwr, &pfn); > + if (ret) { > + __vfio_dma_unmap(iommu, start, i, rdwr); > + return ret; > + } > + > + /* Only add actual locked pages to accounting */ > + if (!is_invalid_reserved_pfn(pfn)) > + locked++; > + > + ret = iommu_map(iommu->domain, iova, > + (phys_addr_t)pfn << PAGE_SHIFT, 0, prot); Put a comment by the 0 saying /* order 0 pages only! */ > + if (ret) { > + /* Back out mappings on error */ > + put_pfn(pfn, rdwr); > + __vfio_dma_unmap(iommu, start, i, rdwr); > + return ret; > + } > + } > + vfio_lock_acct(locked); > + return 0; > +} > + > +static inline int ranges_overlap(unsigned long start1, size_t size1, Perhaps a bool? > + unsigned long start2, size_t size2) > +{ > + return !(start1 + size1 <= start2 || start2 + size2 <= start1); > +} > + > +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu, > + dma_addr_t start, size_t size) > +{ > + struct list_head *pos; > + struct dma_map_page *mlp; > + > + list_for_each(pos, &iommu->dm_list) { > + mlp = list_entry(pos, struct dma_map_page, list); > + if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage), > + start, size)) > + return mlp; > + } > + return NULL; > +} > + > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start, > + size_t size, struct dma_map_page *mlp) > +{ > + struct dma_map_page *split; > + int npage_lo, npage_hi; > + > + /* Existing dma region is completely covered, unmap all */ > + if (start <= mlp->daddr && > + start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) { > + vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr); > + list_del(&mlp->list); > + npage_lo = mlp->npage; > + kfree(mlp); > + return npage_lo; > + } > + > + /* Overlap low address of existing range */ > + if (start <= mlp->daddr) { > + size_t overlap; > + > + overlap = start + size - mlp->daddr; > + npage_lo = overlap >> PAGE_SHIFT; > + npage_hi = mlp->npage - npage_lo; > + > + vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr); > + mlp->daddr += overlap; > + mlp->vaddr += overlap; > + mlp->npage -= npage_lo; > + return npage_lo; > + } > + > + /* Overlap high address of existing range */ > + if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) { > + size_t overlap; > + > + overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start; > + npage_hi = overlap >> PAGE_SHIFT; > + npage_lo = mlp->npage - npage_hi; > + > + vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr); > + mlp->npage -= npage_hi; > + return npage_hi; > + } > + > + /* Split existing */ > + npage_lo = (start - mlp->daddr) >> PAGE_SHIFT; > + npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo; > + > + split = kzalloc(sizeof *split, GFP_KERNEL); > + if (!split) > + return -ENOMEM; > + > + vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr); > + > + mlp->npage = npage_lo; > + > + split->npage = npage_hi; > + split->daddr = start + size; > + split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size; > + split->rdwr = mlp->rdwr; > + list_add(&split->list, &iommu->dm_list); > + return size >> PAGE_SHIFT; > +} > + > +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp) > +{ > + int ret = 0; > + size_t npage = dmp->size >> PAGE_SHIFT; > + struct list_head *pos, *n; > + > + if (dmp->dmaaddr & ~PAGE_MASK) > + return -EINVAL; > + if (dmp->size & ~PAGE_MASK) > + return -EINVAL; > + > + mutex_lock(&iommu->dgate); > + > + list_for_each_safe(pos, n, &iommu->dm_list) { > + struct dma_map_page *mlp; > + > + mlp = list_entry(pos, struct dma_map_page, list); > + if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage), > + dmp->dmaaddr, dmp->size)) { > + ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr, > + dmp->size, mlp); > + if (ret > 0) > + npage -= NPAGE_TO_SIZE(ret); > + if (ret < 0 || npage == 0) > + break; > + } > + } > + mutex_unlock(&iommu->dgate); > + return ret > 0 ? 0 : ret; > +} > + > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp) > +{ > + int npage; > + struct dma_map_page *mlp, *mmlp = NULL; > + dma_addr_t daddr = dmp->dmaaddr; > + unsigned long locked, lock_limit, vaddr = dmp->vaddr; > + size_t size = dmp->size; > + int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE; > + > + if (vaddr & (PAGE_SIZE-1)) > + return -EINVAL; > + if (daddr & (PAGE_SIZE-1)) > + return -EINVAL; > + if (size & (PAGE_SIZE-1)) > + return -EINVAL; > + > + npage = size >> PAGE_SHIFT; > + if (!npage) > + return -EINVAL; > + > + if (!iommu) > + return -EINVAL; > + > + mutex_lock(&iommu->dgate); > + > + if (vfio_find_dma(iommu, daddr, size)) { > + ret = -EBUSY; > + goto out_lock; > + } > + > + /* account for locked pages */ > + locked = current->mm->locked_vm + npage; > + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) { > + printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n", > + __func__, rlimit(RLIMIT_MEMLOCK)); > + ret = -ENOMEM; > + goto out_lock; > + } > + > + ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr); > + if (ret) > + goto out_lock; > + > + /* Check if we abut a region below */ > + if (daddr) { > + mlp = vfio_find_dma(iommu, daddr - 1, 1); > + if (mlp && mlp->rdwr == rdwr && > + mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) { > + > + mlp->npage += npage; > + daddr = mlp->daddr; > + vaddr = mlp->vaddr; > + npage = mlp->npage; > + size = NPAGE_TO_SIZE(npage); > + > + mmlp = mlp; > + } > + } > + > + if (daddr + size) { > + mlp = vfio_find_dma(iommu, daddr + size, 1); > + if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size) { > + > + mlp->npage += npage; > + mlp->daddr = daddr; > + mlp->vaddr = vaddr; > + > + /* If merged above and below, remove previously > + * merged entry. New entry covers it. */ > + if (mmlp) { > + list_del(&mmlp->list); > + kfree(mmlp); > + } > + mmlp = mlp; > + } > + } > + > + if (!mmlp) { > + mlp = kzalloc(sizeof *mlp, GFP_KERNEL); > + if (!mlp) { > + ret = -ENOMEM; > + vfio_dma_unmap(iommu, daddr, npage, rdwr); > + goto out_lock; > + } > + > + mlp->npage = npage; > + mlp->daddr = daddr; > + mlp->vaddr = vaddr; > + mlp->rdwr = rdwr; > + list_add(&mlp->list, &iommu->dm_list); > + } > + > +out_lock: > + mutex_unlock(&iommu->dgate); > + return ret; > +} > + > +static int vfio_iommu_release(struct inode *inode, struct file *filep) > +{ > + struct vfio_iommu *iommu = filep->private_data; > + > + vfio_release_iommu(iommu); > + return 0; > +} > + > +static long vfio_iommu_unl_ioctl(struct file *filep, > + unsigned int cmd, unsigned long arg) > +{ > + struct vfio_iommu *iommu = filep->private_data; > + int ret = -ENOSYS; > + > + if (cmd == VFIO_IOMMU_GET_FLAGS) { Something is weird with the tabbing here.. > + u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY; > + > + ret = put_user(flags, (u64 __user *)arg); > + > + } else if (cmd == VFIO_IOMMU_MAP_DMA) { > + struct vfio_dma_map dm; > + > + if (copy_from_user(&dm, (void __user *)arg, sizeof dm)) > + return -EFAULT; > + > + ret = vfio_dma_map_dm(iommu, &dm); > + > + if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm)) > + ret = -EFAULT; > + > + } else if (cmd == VFIO_IOMMU_UNMAP_DMA) { > + struct vfio_dma_map dm; > + > + if (copy_from_user(&dm, (void __user *)arg, sizeof dm)) > + return -EFAULT; > + > + ret = vfio_dma_unmap_dm(iommu, &dm); > + > + if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm)) > + ret = -EFAULT; > + } > + return ret; > +} > + > +#ifdef CONFIG_COMPAT > +static long vfio_iommu_compat_ioctl(struct file *filep, > + unsigned int cmd, unsigned long arg) > +{ > + arg = (unsigned long)compat_ptr(arg); > + return vfio_iommu_unl_ioctl(filep, cmd, arg); > +} > +#endif /* CONFIG_COMPAT */ > + > +const struct file_operations vfio_iommu_fops = { > + .owner = THIS_MODULE, > + .release = vfio_iommu_release, > + .unlocked_ioctl = vfio_iommu_unl_ioctl, > +#ifdef CONFIG_COMPAT > + .compat_ioctl = vfio_iommu_compat_ioctl, > +#endif > +}; > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c > new file mode 100644 > index 0000000..6169356 > --- /dev/null > +++ b/drivers/vfio/vfio_main.c > @@ -0,0 +1,1151 @@ > +/* > + * VFIO framework > + * > + * Copyright (C) 2011 Red Hat, Inc. All rights reserved. > + * Author: Alex Williamson <alex.williamson@xxxxxxxxxx> > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + * > + * Derived from original vfio: > + * Copyright 2010 Cisco Systems, Inc. All rights reserved. > + * Author: Tom Lyon, pugs@xxxxxxxxx > + */ > + > +#include <linux/cdev.h> > +#include <linux/compat.h> > +#include <linux/device.h> > +#include <linux/file.h> > +#include <linux/anon_inodes.h> > +#include <linux/fs.h> > +#include <linux/idr.h> > +#include <linux/iommu.h> > +#include <linux/mm.h> > +#include <linux/module.h> > +#include <linux/slab.h> > +#include <linux/string.h> > +#include <linux/uaccess.h> > +#include <linux/vfio.h> > +#include <linux/wait.h> > + > +#include "vfio_private.h" > + > +#define DRIVER_VERSION "0.2" > +#define DRIVER_AUTHOR "Alex Williamson <alex.williamson@xxxxxxxxxx>" > +#define DRIVER_DESC "VFIO - User Level meta-driver" > + > +static int allow_unsafe_intrs; __read_mostly > +module_param(allow_unsafe_intrs, int, 0); S_IRUGO ? > +MODULE_PARM_DESC(allow_unsafe_intrs, > + "Allow use of IOMMUs which do not support interrupt remapping"); > + > +static struct vfio { > + dev_t devt; > + struct cdev cdev; > + struct list_head group_list; > + struct mutex lock; > + struct kref kref; > + struct class *class; > + struct idr idr; > + wait_queue_head_t release_q; > +} vfio; You probably want to move this below the 'vfio_group' as vfio contains the vfio_group. > + > +static const struct file_operations vfio_group_fops; > +extern const struct file_operations vfio_iommu_fops; > + > +struct vfio_group { > + dev_t devt; > + unsigned int groupid; > + struct bus_type *bus; > + struct vfio_iommu *iommu; > + struct list_head device_list; > + struct list_head iommu_next; > + struct list_head group_next; > + int refcnt; > +}; > + > +struct vfio_device { > + struct device *dev; > + const struct vfio_device_ops *ops; > + struct vfio_iommu *iommu; > + struct vfio_group *group; > + struct list_head device_next; > + bool attached; > + int refcnt; > + void *device_data; > +}; And perhaps move this above vfio_group. As vfio_group contains a list of these structures? > + > +/* > + * Helper functions called under vfio.lock > + */ > + > +/* Return true if any devices within a group are opened */ > +static bool __vfio_group_devs_inuse(struct vfio_group *group) > +{ > + struct list_head *pos; > + > + list_for_each(pos, &group->device_list) { > + struct vfio_device *device; > + > + device = list_entry(pos, struct vfio_device, device_next); > + if (device->refcnt) > + return true; > + } > + return false; > +} > + > +/* Return true if any of the groups attached to an iommu are opened. > + * We can only tear apart merged groups when nothing is left open. */ > +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu) > +{ > + struct list_head *pos; > + > + list_for_each(pos, &iommu->group_list) { > + struct vfio_group *group; > + > + group = list_entry(pos, struct vfio_group, iommu_next); > + if (group->refcnt) > + return true; > + } > + return false; > +} > + > +/* An iommu is "in use" if it has a file descriptor open or if any of > + * the groups assigned to the iommu have devices open. */ > +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu) > +{ > + struct list_head *pos; > + > + if (iommu->refcnt) > + return true; > + > + list_for_each(pos, &iommu->group_list) { > + struct vfio_group *group; > + > + group = list_entry(pos, struct vfio_group, iommu_next); > + > + if (__vfio_group_devs_inuse(group)) > + return true; > + } > + return false; > +} > + > +static void __vfio_group_set_iommu(struct vfio_group *group, > + struct vfio_iommu *iommu) > +{ > + struct list_head *pos; > + > + if (group->iommu) > + list_del(&group->iommu_next); > + if (iommu) > + list_add(&group->iommu_next, &iommu->group_list); > + > + group->iommu = iommu; > + > + list_for_each(pos, &group->device_list) { > + struct vfio_device *device; > + > + device = list_entry(pos, struct vfio_device, device_next); > + device->iommu = iommu; > + } > +} > + > +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu, > + struct vfio_device *device) > +{ > + BUG_ON(!iommu->domain && device->attached); Whoa. Heavy hammer there. Perhaps WARN_ON as you do check it later on. > + > + if (!iommu->domain || !device->attached) > + return; > + > + iommu_detach_device(iommu->domain, device->dev); > + device->attached = false; > +} > + > +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu, > + struct vfio_group *group) > +{ > + struct list_head *pos; > + > + list_for_each(pos, &group->device_list) { > + struct vfio_device *device; > + > + device = list_entry(pos, struct vfio_device, device_next); > + __vfio_iommu_detach_dev(iommu, device); > + } > +} > + > +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu, > + struct vfio_device *device) > +{ > + int ret; > + > + BUG_ON(device->attached); How about: WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register the device again! Tell him/her to stop please.\n"); > + > + if (!iommu || !iommu->domain) > + return -EINVAL; > + > + ret = iommu_attach_device(iommu->domain, device->dev); > + if (!ret) > + device->attached = true; > + > + return ret; > +} > + > +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu, > + struct vfio_group *group) > +{ > + struct list_head *pos; > + > + list_for_each(pos, &group->device_list) { > + struct vfio_device *device; > + int ret; > + > + device = list_entry(pos, struct vfio_device, device_next); > + ret = __vfio_iommu_attach_dev(iommu, device); > + if (ret) { > + __vfio_iommu_detach_group(iommu, group); > + return ret; > + } > + } > + return 0; > +} > + > +/* The iommu is viable, ie. ready to be configured, when all the devices > + * for all the groups attached to the iommu are bound to their vfio device > + * drivers (ex. vfio-pci). This sets the device_data private data pointer. */ > +static bool __vfio_iommu_viable(struct vfio_iommu *iommu) > +{ > + struct list_head *gpos, *dpos; > + > + list_for_each(gpos, &iommu->group_list) { > + struct vfio_group *group; > + group = list_entry(gpos, struct vfio_group, iommu_next); > + > + list_for_each(dpos, &group->device_list) { > + struct vfio_device *device; > + device = list_entry(dpos, > + struct vfio_device, device_next); > + > + if (!device->device_data) > + return false; > + } > + } > + return true; > +} > + > +static void __vfio_close_iommu(struct vfio_iommu *iommu) > +{ > + struct list_head *pos; > + > + if (!iommu->domain) > + return; > + > + list_for_each(pos, &iommu->group_list) { > + struct vfio_group *group; > + group = list_entry(pos, struct vfio_group, iommu_next); > + > + __vfio_iommu_detach_group(iommu, group); > + } > + > + vfio_iommu_unmapall(iommu); > + > + iommu_domain_free(iommu->domain); > + iommu->domain = NULL; > + iommu->mm = NULL; > +} > + > +/* Open the IOMMU. This gates all access to the iommu or device file > + * descriptors and sets current->mm as the exclusive user. */ > +static int __vfio_open_iommu(struct vfio_iommu *iommu) > +{ > + struct list_head *pos; > + int ret; > + > + if (!__vfio_iommu_viable(iommu)) > + return -EBUSY; > + > + if (iommu->domain) > + return -EINVAL; > + > + iommu->domain = iommu_domain_alloc(iommu->bus); > + if (!iommu->domain) > + return -EFAULT; ENOMEM? > + > + list_for_each(pos, &iommu->group_list) { > + struct vfio_group *group; > + group = list_entry(pos, struct vfio_group, iommu_next); > + > + ret = __vfio_iommu_attach_group(iommu, group); > + if (ret) { > + __vfio_close_iommu(iommu); > + return ret; > + } > + } > + > + if (!allow_unsafe_intrs && > + !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) { > + __vfio_close_iommu(iommu); > + return -EFAULT; > + } > + > + iommu->cache = (iommu_domain_has_cap(iommu->domain, > + IOMMU_CAP_CACHE_COHERENCY) != 0); > + iommu->mm = current->mm; > + > + return 0; > +} > + > +/* Actively try to tear down the iommu and merged groups. If there are no > + * open iommu or device fds, we close the iommu. If we close the iommu and > + * there are also no open group fds, we can futher dissolve the group to > + * iommu association and free the iommu data structure. */ > +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu) > +{ > + > + if (__vfio_iommu_inuse(iommu)) > + return -EBUSY; > + > + __vfio_close_iommu(iommu); > + > + if (!__vfio_iommu_groups_inuse(iommu)) { > + struct list_head *pos, *ppos; > + > + list_for_each_safe(pos, ppos, &iommu->group_list) { > + struct vfio_group *group; > + > + group = list_entry(pos, struct vfio_group, iommu_next); > + __vfio_group_set_iommu(group, NULL); > + } > + > + > + kfree(iommu); > + } > + > + return 0; > +} > + > +static struct vfio_device *__vfio_lookup_dev(struct device *dev) > +{ > + struct list_head *gpos; > + unsigned int groupid; > + > + if (iommu_device_group(dev, &groupid)) Hmm, where is this defined? v3.2-rc1 does not seem to have it? > + return NULL; > + > + list_for_each(gpos, &vfio.group_list) { > + struct vfio_group *group; > + struct list_head *dpos; > + > + group = list_entry(gpos, struct vfio_group, group_next); > + > + if (group->groupid != groupid) > + continue; > + > + list_for_each(dpos, &group->device_list) { > + struct vfio_device *device; > + > + device = list_entry(dpos, > + struct vfio_device, device_next); > + > + if (device->dev == dev) > + return device; > + } > + } > + return NULL; > +} > + > +/* All release paths simply decrement the refcnt, attempt to teardown > + * the iommu and merged groups, and wakeup anything that might be > + * waiting if we successfully dissolve anything. */ > +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu) > +{ > + bool wake; > + > + mutex_lock(&vfio.lock); > + > + (*refcnt)--; > + wake = (__vfio_try_dissolve_iommu(iommu) == 0); > + > + mutex_unlock(&vfio.lock); > + > + if (wake) > + wake_up(&vfio.release_q); > + > + return 0; > +} > + > +/* > + * Device fops - passthrough to vfio device driver w/ device_data > + */ > +static int vfio_device_release(struct inode *inode, struct file *filep) > +{ > + struct vfio_device *device = filep->private_data; > + > + vfio_do_release(&device->refcnt, device->iommu); > + > + device->ops->put(device->device_data); > + > + return 0; > +} > + > +static long vfio_device_unl_ioctl(struct file *filep, > + unsigned int cmd, unsigned long arg) > +{ > + struct vfio_device *device = filep->private_data; > + > + return device->ops->ioctl(device->device_data, cmd, arg); > +} > + > +static ssize_t vfio_device_read(struct file *filep, char __user *buf, > + size_t count, loff_t *ppos) > +{ > + struct vfio_device *device = filep->private_data; > + > + return device->ops->read(device->device_data, buf, count, ppos); > +} > + > +static ssize_t vfio_device_write(struct file *filep, const char __user *buf, > + size_t count, loff_t *ppos) > +{ > + struct vfio_device *device = filep->private_data; > + > + return device->ops->write(device->device_data, buf, count, ppos); > +} > + > +static int vfio_device_mmap(struct file *filep, struct vm_area_struct *vma) > +{ > + struct vfio_device *device = filep->private_data; > + > + return device->ops->mmap(device->device_data, vma); > +} > + > +#ifdef CONFIG_COMPAT > +static long vfio_device_compat_ioctl(struct file *filep, > + unsigned int cmd, unsigned long arg) > +{ > + arg = (unsigned long)compat_ptr(arg); > + return vfio_device_unl_ioctl(filep, cmd, arg); > +} > +#endif /* CONFIG_COMPAT */ > + > +const struct file_operations vfio_device_fops = { > + .owner = THIS_MODULE, > + .release = vfio_device_release, > + .read = vfio_device_read, > + .write = vfio_device_write, > + .unlocked_ioctl = vfio_device_unl_ioctl, > +#ifdef CONFIG_COMPAT > + .compat_ioctl = vfio_device_compat_ioctl, > +#endif > + .mmap = vfio_device_mmap, > +}; > + > +/* > + * Group fops > + */ > +static int vfio_group_open(struct inode *inode, struct file *filep) > +{ > + struct vfio_group *group; > + int ret = 0; > + > + mutex_lock(&vfio.lock); > + > + group = idr_find(&vfio.idr, iminor(inode)); > + > + if (!group) { > + ret = -ENODEV; > + goto out; > + } > + > + filep->private_data = group; > + > + if (!group->iommu) { > + struct vfio_iommu *iommu; > + > + iommu = kzalloc(sizeof(*iommu), GFP_KERNEL); > + if (!iommu) { > + ret = -ENOMEM; > + goto out; > + } > + INIT_LIST_HEAD(&iommu->group_list); > + INIT_LIST_HEAD(&iommu->dm_list); > + mutex_init(&iommu->dgate); > + iommu->bus = group->bus; > + __vfio_group_set_iommu(group, iommu); > + } > + group->refcnt++; > + > +out: > + mutex_unlock(&vfio.lock); > + > + return ret; > +} > + > +static int vfio_group_release(struct inode *inode, struct file *filep) > +{ > + struct vfio_group *group = filep->private_data; > + > + return vfio_do_release(&group->refcnt, group->iommu); > +} > + > +/* Attempt to merge the group pointed to by fd into group. The merge-ee > + * group must not have an iommu or any devices open because we cannot > + * maintain that context across the merge. The merge-er group can be > + * in use. */ > +static int vfio_group_merge(struct vfio_group *group, int fd) > +{ > + struct vfio_group *new; > + struct vfio_iommu *old_iommu; > + struct file *file; > + int ret = 0; > + bool opened = false; > + > + mutex_lock(&vfio.lock); > + > + file = fget(fd); > + if (!file) { > + ret = -EBADF; > + goto out_noput; > + } > + > + /* Sanity check, is this really our fd? */ > + if (file->f_op != &vfio_group_fops) { > + ret = -EINVAL; > + goto out; > + } > + > + new = file->private_data; > + > + if (!new || new == group || !new->iommu || > + new->iommu->domain || new->bus != group->bus) { > + ret = -EINVAL; > + goto out; > + } > + > + /* We need to attach all the devices to each domain separately > + * in order to validate that the capabilities match for both. */ > + ret = __vfio_open_iommu(new->iommu); > + if (ret) > + goto out; > + > + if (!group->iommu->domain) { > + ret = __vfio_open_iommu(group->iommu); > + if (ret) > + goto out; > + opened = true; > + } > + > + /* If cache coherency doesn't match we'd potentialy need to > + * remap existing iommu mappings in the merge-er domain. > + * Poor return to bother trying to allow this currently. */ > + if (iommu_domain_has_cap(group->iommu->domain, > + IOMMU_CAP_CACHE_COHERENCY) != > + iommu_domain_has_cap(new->iommu->domain, > + IOMMU_CAP_CACHE_COHERENCY)) { > + __vfio_close_iommu(new->iommu); > + if (opened) > + __vfio_close_iommu(group->iommu); > + ret = -EINVAL; > + goto out; > + } > + > + /* Close the iommu for the merge-ee and attach all its devices > + * to the merge-er iommu. */ > + __vfio_close_iommu(new->iommu); > + > + ret = __vfio_iommu_attach_group(group->iommu, new); > + if (ret) > + goto out; > + > + /* set_iommu unlinks new from the iommu, so save a pointer to it */ > + old_iommu = new->iommu; > + __vfio_group_set_iommu(new, group->iommu); > + kfree(old_iommu); > + > +out: > + fput(file); > +out_noput: > + mutex_unlock(&vfio.lock); > + return ret; > +} > + > +/* Unmerge the group pointed to by fd from group. */ > +static int vfio_group_unmerge(struct vfio_group *group, int fd) > +{ > + struct vfio_group *new; > + struct vfio_iommu *new_iommu; > + struct file *file; > + int ret = 0; > + > + /* Since the merge-out group is already opened, it needs to > + * have an iommu struct associated with it. */ > + new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL); > + if (!new_iommu) > + return -ENOMEM; > + > + INIT_LIST_HEAD(&new_iommu->group_list); > + INIT_LIST_HEAD(&new_iommu->dm_list); > + mutex_init(&new_iommu->dgate); > + new_iommu->bus = group->bus; > + > + mutex_lock(&vfio.lock); > + > + file = fget(fd); > + if (!file) { > + ret = -EBADF; > + goto out_noput; > + } > + > + /* Sanity check, is this really our fd? */ > + if (file->f_op != &vfio_group_fops) { > + ret = -EINVAL; > + goto out; > + } > + > + new = file->private_data; > + if (!new || new == group || new->iommu != group->iommu) { > + ret = -EINVAL; > + goto out; > + } > + > + /* We can't merge-out a group with devices still in use. */ > + if (__vfio_group_devs_inuse(new)) { > + ret = -EBUSY; > + goto out; > + } > + > + __vfio_iommu_detach_group(group->iommu, new); > + __vfio_group_set_iommu(new, new_iommu); > + > +out: > + fput(file); > +out_noput: > + if (ret) > + kfree(new_iommu); > + mutex_unlock(&vfio.lock); > + return ret; > +} > + > +/* Get a new iommu file descriptor. This will open the iommu, setting > + * the current->mm ownership if it's not already set. */ > +static int vfio_group_get_iommu_fd(struct vfio_group *group) > +{ > + int ret = 0; > + > + mutex_lock(&vfio.lock); > + > + if (!group->iommu->domain) { > + ret = __vfio_open_iommu(group->iommu); > + if (ret) > + goto out; > + } > + > + ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops, > + group->iommu, O_RDWR); > + if (ret < 0) > + goto out; > + > + group->iommu->refcnt++; > +out: > + mutex_unlock(&vfio.lock); > + return ret; > +} > + > +/* Get a new device file descriptor. This will open the iommu, setting > + * the current->mm ownership if it's not already set. It's difficult to > + * specify the requirements for matching a user supplied buffer to a > + * device, so we use a vfio driver callback to test for a match. For > + * PCI, dev_name(dev) is unique, but other drivers may require including > + * a parent device string. */ > +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf) > +{ > + struct vfio_iommu *iommu = group->iommu; > + struct list_head *gpos; > + int ret = -ENODEV; > + > + mutex_lock(&vfio.lock); > + > + if (!iommu->domain) { > + ret = __vfio_open_iommu(iommu); > + if (ret) > + goto out; > + } > + > + list_for_each(gpos, &iommu->group_list) { > + struct list_head *dpos; > + > + group = list_entry(gpos, struct vfio_group, iommu_next); > + > + list_for_each(dpos, &group->device_list) { > + struct vfio_device *device; > + > + device = list_entry(dpos, > + struct vfio_device, device_next); > + > + if (device->ops->match(device->dev, buf)) { > + struct file *file; > + > + if (device->ops->get(device->device_data)) { > + ret = -EFAULT; > + goto out; > + } > + > + /* We can't use anon_inode_getfd(), like above > + * because we need to modify the f_mode flags > + * directly to allow more than just ioctls */ > + ret = get_unused_fd(); > + if (ret < 0) { > + device->ops->put(device->device_data); > + goto out; > + } > + > + file = anon_inode_getfile("[vfio-device]", > + &vfio_device_fops, > + device, O_RDWR); > + if (IS_ERR(file)) { > + put_unused_fd(ret); > + ret = PTR_ERR(file); > + device->ops->put(device->device_data); > + goto out; > + } > + > + /* Todo: add an anon_inode interface to do > + * this. Appears to be missing by lack of > + * need rather than explicitly prevented. > + * Now there's need. */ > + file->f_mode |= (FMODE_LSEEK | > + FMODE_PREAD | > + FMODE_PWRITE); > + > + fd_install(ret, file); > + > + device->refcnt++; > + goto out; > + } > + } > + } > +out: > + mutex_unlock(&vfio.lock); > + return ret; > +} > + > +static long vfio_group_unl_ioctl(struct file *filep, > + unsigned int cmd, unsigned long arg) > +{ > + struct vfio_group *group = filep->private_data; > + > + if (cmd == VFIO_GROUP_GET_FLAGS) { > + u64 flags = 0; > + > + mutex_lock(&vfio.lock); > + if (__vfio_iommu_viable(group->iommu)) > + flags |= VFIO_GROUP_FLAGS_VIABLE; > + mutex_unlock(&vfio.lock); > + > + if (group->iommu->mm) > + flags |= VFIO_GROUP_FLAGS_MM_LOCKED; > + > + return put_user(flags, (u64 __user *)arg); > + } > + > + /* Below commands are restricted once the mm is set */ > + if (group->iommu->mm && group->iommu->mm != current->mm) > + return -EPERM; > + > + if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) { > + int fd; > + > + if (get_user(fd, (int __user *)arg)) > + return -EFAULT; > + if (fd < 0) > + return -EINVAL; > + > + if (cmd == VFIO_GROUP_MERGE) > + return vfio_group_merge(group, fd); > + else > + return vfio_group_unmerge(group, fd); > + } else if (cmd == VFIO_GROUP_GET_IOMMU_FD) { > + return vfio_group_get_iommu_fd(group); > + } else if (cmd == VFIO_GROUP_GET_DEVICE_FD) { > + char *buf; > + int ret; > + > + buf = strndup_user((const char __user *)arg, PAGE_SIZE); > + if (IS_ERR(buf)) > + return PTR_ERR(buf); > + > + ret = vfio_group_get_device_fd(group, buf); > + kfree(buf); > + return ret; > + } > + > + return -ENOSYS; > +} > + > +#ifdef CONFIG_COMPAT > +static long vfio_group_compat_ioctl(struct file *filep, > + unsigned int cmd, unsigned long arg) > +{ > + arg = (unsigned long)compat_ptr(arg); > + return vfio_group_unl_ioctl(filep, cmd, arg); > +} > +#endif /* CONFIG_COMPAT */ > + > +static const struct file_operations vfio_group_fops = { > + .owner = THIS_MODULE, > + .open = vfio_group_open, > + .release = vfio_group_release, > + .unlocked_ioctl = vfio_group_unl_ioctl, > +#ifdef CONFIG_COMPAT > + .compat_ioctl = vfio_group_compat_ioctl, > +#endif > +}; > + > +/* iommu fd release hook */ > +int vfio_release_iommu(struct vfio_iommu *iommu) > +{ > + return vfio_do_release(&iommu->refcnt, iommu); > +} > + > +/* > + * VFIO driver API > + */ > + > +/* Add a new device to the vfio framework with associated vfio driver > + * callbacks. This is the entry point for vfio drivers to register devices. */ > +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops) > +{ > + struct list_head *pos; > + struct vfio_group *group = NULL; > + struct vfio_device *device = NULL; > + unsigned int groupid; > + int ret = 0; > + bool new_group = false; > + > + if (!ops) > + return -EINVAL; > + > + if (iommu_device_group(dev, &groupid)) > + return -ENODEV; > + > + mutex_lock(&vfio.lock); > + > + list_for_each(pos, &vfio.group_list) { > + group = list_entry(pos, struct vfio_group, group_next); > + if (group->groupid == groupid) > + break; > + group = NULL; > + } > + > + if (!group) { > + int minor; > + > + if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) { > + ret = -ENOMEM; > + goto out; > + } > + > + group = kzalloc(sizeof(*group), GFP_KERNEL); > + if (!group) { > + ret = -ENOMEM; > + goto out; > + } > + > + group->groupid = groupid; > + INIT_LIST_HEAD(&group->device_list); > + > + ret = idr_get_new(&vfio.idr, group, &minor); > + if (ret == 0 && minor > MINORMASK) { > + idr_remove(&vfio.idr, minor); > + kfree(group); > + ret = -ENOSPC; > + goto out; > + } > + > + group->devt = MKDEV(MAJOR(vfio.devt), minor); > + device_create(vfio.class, NULL, group->devt, > + group, "%u", groupid); > + > + group->bus = dev->bus; Oh, so that is how the IOMMU iommu_ops get copied! You might want to mention that - I was not sure where the 'handoff' is was done to insert a device so that it can do iommu_ops properly. Ok, so the time when a device is detected whether it can do IOMMU is when we try to open it - as that is when iommu_domain_alloc is called which can return NULL if the iommu_ops is not set. So what about devices that don't have an iommu_ops? Say they are using SWIOTLB? (like the AMD-Vi sometimes does if the device is not on its list). Can we use iommu_present? > + list_add(&group->group_next, &vfio.group_list); > + new_group = true; > + } else { > + if (group->bus != dev->bus) { > + printk(KERN_WARNING > + "Error: IOMMU group ID conflict. Group ID %u " > + "on both bus %s and %s\n", groupid, > + group->bus->name, dev->bus->name); > + ret = -EFAULT; > + goto out; > + } > + > + list_for_each(pos, &group->device_list) { > + device = list_entry(pos, > + struct vfio_device, device_next); > + if (device->dev == dev) > + break; > + device = NULL; > + } > + } > + > + if (!device) { > + if (__vfio_group_devs_inuse(group) || > + (group->iommu && group->iommu->refcnt)) { > + printk(KERN_WARNING > + "Adding device %s to group %u while group is already in use!!\n", > + dev_name(dev), group->groupid); > + /* XXX How to prevent other drivers from claiming? */ > + } > + > + device = kzalloc(sizeof(*device), GFP_KERNEL); > + if (!device) { > + /* If we just created this group, tear it down */ > + if (new_group) { > + list_del(&group->group_next); > + device_destroy(vfio.class, group->devt); > + idr_remove(&vfio.idr, MINOR(group->devt)); > + kfree(group); > + } > + ret = -ENOMEM; > + goto out; > + } > + > + list_add(&device->device_next, &group->device_list); > + device->dev = dev; > + device->ops = ops; > + device->iommu = group->iommu; /* NULL if new */ > + __vfio_iommu_attach_dev(group->iommu, device); > + } > +out: > + mutex_unlock(&vfio.lock); > + return ret; > +} > +EXPORT_SYMBOL_GPL(vfio_group_add_dev); > + > +/* Remove a device from the vfio framework */ > +void vfio_group_del_dev(struct device *dev) > +{ > + struct list_head *pos; > + struct vfio_group *group = NULL; > + struct vfio_device *device = NULL; > + unsigned int groupid; > + > + if (iommu_device_group(dev, &groupid)) > + return; > + > + mutex_lock(&vfio.lock); > + > + list_for_each(pos, &vfio.group_list) { > + group = list_entry(pos, struct vfio_group, group_next); > + if (group->groupid == groupid) > + break; > + group = NULL; > + } > + > + if (!group) > + goto out; > + > + list_for_each(pos, &group->device_list) { > + device = list_entry(pos, struct vfio_device, device_next); > + if (device->dev == dev) > + break; > + device = NULL; > + } > + > + if (!device) > + goto out; > + > + BUG_ON(device->refcnt); > + > + if (device->attached) > + __vfio_iommu_detach_dev(group->iommu, device); > + > + list_del(&device->device_next); > + kfree(device); > + > + /* If this was the only device in the group, remove the group. > + * Note that we intentionally unmerge empty groups here if the > + * group fd isn't opened. */ > + if (list_empty(&group->device_list) && group->refcnt == 0) { > + struct vfio_iommu *iommu = group->iommu; > + > + if (iommu) { > + __vfio_group_set_iommu(group, NULL); > + __vfio_try_dissolve_iommu(iommu); > + } > + > + device_destroy(vfio.class, group->devt); > + idr_remove(&vfio.idr, MINOR(group->devt)); > + list_del(&group->group_next); > + kfree(group); > + } > +out: > + mutex_unlock(&vfio.lock); > +} > +EXPORT_SYMBOL_GPL(vfio_group_del_dev); > + > +/* When a device is bound to a vfio device driver (ex. vfio-pci), this > + * entry point is used to mark the device usable (viable). The vfio > + * device driver associates a private device_data struct with the device > + * here, which will later be return for vfio_device_fops callbacks. */ > +int vfio_bind_dev(struct device *dev, void *device_data) > +{ > + struct vfio_device *device; > + int ret = -EINVAL; > + > + BUG_ON(!device_data); > + > + mutex_lock(&vfio.lock); > + > + device = __vfio_lookup_dev(dev); > + > + BUG_ON(!device); > + > + ret = dev_set_drvdata(dev, device); > + if (!ret) > + device->device_data = device_data; > + > + mutex_unlock(&vfio.lock); > + return ret; > +} > +EXPORT_SYMBOL_GPL(vfio_bind_dev); > + > +/* A device is only removeable if the iommu for the group is not in use. */ > +static bool vfio_device_removeable(struct vfio_device *device) > +{ > + bool ret = true; > + > + mutex_lock(&vfio.lock); > + > + if (device->iommu && __vfio_iommu_inuse(device->iommu)) > + ret = false; > + > + mutex_unlock(&vfio.lock); > + return ret; > +} > + > +/* Notify vfio that a device is being unbound from the vfio device driver > + * and return the device private device_data pointer. If the group is > + * in use, we need to block or take other measures to make it safe for > + * the device to be removed from the iommu. */ > +void *vfio_unbind_dev(struct device *dev) > +{ > + struct vfio_device *device = dev_get_drvdata(dev); > + void *device_data; > + > + BUG_ON(!device); > + > +again: > + if (!vfio_device_removeable(device)) { > + /* XXX signal for all devices in group to be removed or > + * resort to killing the process holding the device fds. > + * For now just block waiting for releases to wake us. */ > + wait_event(vfio.release_q, vfio_device_removeable(device)); > + } > + > + mutex_lock(&vfio.lock); > + > + /* Need to re-check that the device is still removeable under lock. */ > + if (device->iommu && __vfio_iommu_inuse(device->iommu)) { > + mutex_unlock(&vfio.lock); > + goto again; > + } > + > + device_data = device->device_data; > + > + device->device_data = NULL; > + dev_set_drvdata(dev, NULL); > + > + mutex_unlock(&vfio.lock); > + return device_data; > +} > +EXPORT_SYMBOL_GPL(vfio_unbind_dev); > + > +/* > + * Module/class support > + */ > +static void vfio_class_release(struct kref *kref) > +{ > + class_destroy(vfio.class); > + vfio.class = NULL; > +} > + > +static char *vfio_devnode(struct device *dev, mode_t *mode) > +{ > + return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev)); > +} > + > +static int __init vfio_init(void) > +{ > + int ret; > + > + idr_init(&vfio.idr); > + mutex_init(&vfio.lock); > + INIT_LIST_HEAD(&vfio.group_list); > + init_waitqueue_head(&vfio.release_q); > + > + kref_init(&vfio.kref); > + vfio.class = class_create(THIS_MODULE, "vfio"); > + if (IS_ERR(vfio.class)) { > + ret = PTR_ERR(vfio.class); > + goto err_class; > + } > + > + vfio.class->devnode = vfio_devnode; > + > + /* FIXME - how many minors to allocate... all of them! */ > + ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio"); > + if (ret) > + goto err_chrdev; > + > + cdev_init(&vfio.cdev, &vfio_group_fops); > + ret = cdev_add(&vfio.cdev, vfio.devt, MINORMASK); > + if (ret) > + goto err_cdev; > + > + pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n"); > + > + return 0; > + > +err_cdev: > + unregister_chrdev_region(vfio.devt, MINORMASK); > +err_chrdev: > + kref_put(&vfio.kref, vfio_class_release); > +err_class: > + return ret; > +} > + > +static void __exit vfio_cleanup(void) > +{ > + struct list_head *gpos, *gppos; > + > + list_for_each_safe(gpos, gppos, &vfio.group_list) { > + struct vfio_group *group; > + struct list_head *dpos, *dppos; > + > + group = list_entry(gpos, struct vfio_group, group_next); > + > + list_for_each_safe(dpos, dppos, &group->device_list) { > + struct vfio_device *device; > + > + device = list_entry(dpos, > + struct vfio_device, device_next); > + vfio_group_del_dev(device->dev); > + } > + } > + > + idr_destroy(&vfio.idr); > + cdev_del(&vfio.cdev); > + unregister_chrdev_region(vfio.devt, MINORMASK); > + kref_put(&vfio.kref, vfio_class_release); > +} > + > +module_init(vfio_init); > +module_exit(vfio_cleanup); > + > +MODULE_VERSION(DRIVER_VERSION); > +MODULE_LICENSE("GPL v2"); > +MODULE_AUTHOR(DRIVER_AUTHOR); > +MODULE_DESCRIPTION(DRIVER_DESC); > diff --git a/drivers/vfio/vfio_private.h b/drivers/vfio/vfio_private.h > new file mode 100644 > index 0000000..350ad67 > --- /dev/null > +++ b/drivers/vfio/vfio_private.h > @@ -0,0 +1,34 @@ > +/* > + * Copyright (C) 2011 Red Hat, Inc. All rights reserved. > + * Author: Alex Williamson <alex.williamson@xxxxxxxxxx> > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License version 2 as > + * published by the Free Software Foundation. > + * > + * Derived from original vfio: > + * Copyright 2010 Cisco Systems, Inc. All rights reserved. > + * Author: Tom Lyon, pugs@xxxxxxxxx > + */ > + > +#include <linux/list.h> > +#include <linux/mutex.h> > + > +#ifndef VFIO_PRIVATE_H > +#define VFIO_PRIVATE_H > + > +struct vfio_iommu { > + struct iommu_domain *domain; > + struct bus_type *bus; > + struct mutex dgate; > + struct list_head dm_list; > + struct mm_struct *mm; > + struct list_head group_list; > + int refcnt; > + bool cache; > +}; > + > +extern int vfio_release_iommu(struct vfio_iommu *iommu); > +extern void vfio_iommu_unmapall(struct vfio_iommu *iommu); > + > +#endif /* VFIO_PRIVATE_H */ > diff --git a/include/linux/vfio.h b/include/linux/vfio.h > new file mode 100644 > index 0000000..4269b08 > --- /dev/null > +++ b/include/linux/vfio.h > @@ -0,0 +1,155 @@ > +/* > + * Copyright 2010 Cisco Systems, Inc. All rights reserved. > + * Author: Tom Lyon, pugs@xxxxxxxxx > + * > + * This program is free software; you may redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; version 2 of the License. > + * > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE > + * SOFTWARE. > + * > + * Portions derived from drivers/uio/uio.c: > + * Copyright(C) 2005, Benedikt Spranger <b.spranger@xxxxxxxxxxxxx> > + * Copyright(C) 2005, Thomas Gleixner <tglx@xxxxxxxxxxxxx> > + * Copyright(C) 2006, Hans J. Koch <hjk@xxxxxxxxxxxxx> > + * Copyright(C) 2006, Greg Kroah-Hartman <greg@xxxxxxxxx> > + * > + * Portions derived from drivers/uio/uio_pci_generic.c: > + * Copyright (C) 2009 Red Hat, Inc. > + * Author: Michael S. Tsirkin <mst@xxxxxxxxxx> > + */ > +#include <linux/types.h> > + > +#ifndef VFIO_H > +#define VFIO_H > + > +#ifdef __KERNEL__ > + > +struct vfio_device_ops { > + bool (*match)(struct device *, char *); > + int (*get)(void *); > + void (*put)(void *); > + ssize_t (*read)(void *, char __user *, > + size_t, loff_t *); > + ssize_t (*write)(void *, const char __user *, > + size_t, loff_t *); > + long (*ioctl)(void *, unsigned int, unsigned long); > + int (*mmap)(void *, struct vm_area_struct *); > +}; > + > +extern int vfio_group_add_dev(struct device *device, > + const struct vfio_device_ops *ops); > +extern void vfio_group_del_dev(struct device *device); > +extern int vfio_bind_dev(struct device *device, void *device_data); > +extern void *vfio_unbind_dev(struct device *device); > + > +#endif /* __KERNEL__ */ > + > +/* > + * VFIO driver - allow mapping and use of certain devices > + * in unprivileged user processes. (If IOMMU is present) > + * Especially useful for Virtual Function parts of SR-IOV devices > + */ > + > + > +/* Kernel & User level defines for ioctls */ > + > +#define VFIO_GROUP_GET_FLAGS _IOR(';', 100, __u64) > + #define VFIO_GROUP_FLAGS_VIABLE (1 << 0) > + #define VFIO_GROUP_FLAGS_MM_LOCKED (1 << 1) > +#define VFIO_GROUP_MERGE _IOW(';', 101, int) > +#define VFIO_GROUP_UNMERGE _IOW(';', 102, int) > +#define VFIO_GROUP_GET_IOMMU_FD _IO(';', 103) > +#define VFIO_GROUP_GET_DEVICE_FD _IOW(';', 104, char *) > + > +/* > + * Structure for DMA mapping of user buffers > + * vaddr, dmaaddr, and size must all be page aligned > + */ > +struct vfio_dma_map { > + __u64 len; /* length of structure */ > + __u64 vaddr; /* process virtual addr */ > + __u64 dmaaddr; /* desired and/or returned dma address */ > + __u64 size; /* size in bytes */ > + __u64 flags; > +#define VFIO_DMA_MAP_FLAG_WRITE (1 << 0) /* req writeable DMA mem */ > +}; > + > +#define VFIO_IOMMU_GET_FLAGS _IOR(';', 105, __u64) > + /* Does the IOMMU support mapping any IOVA to any virtual address? */ > + #define VFIO_IOMMU_FLAGS_MAP_ANY (1 << 0) > +#define VFIO_IOMMU_MAP_DMA _IOWR(';', 106, struct vfio_dma_map) > +#define VFIO_IOMMU_UNMAP_DMA _IOWR(';', 107, struct vfio_dma_map) > + > +#define VFIO_DEVICE_GET_FLAGS _IOR(';', 108, __u64) > + #define VFIO_DEVICE_FLAGS_PCI (1 << 0) > + #define VFIO_DEVICE_FLAGS_DT (1 << 1) > + #define VFIO_DEVICE_FLAGS_RESET (1 << 2) > +#define VFIO_DEVICE_GET_NUM_REGIONS _IOR(';', 109, int) > + > +struct vfio_region_info { > + __u32 len; /* length of structure */ > + __u32 index; /* region number */ > + __u64 size; /* size in bytes of region */ > + __u64 offset; /* start offset of region */ > + __u64 flags; > +#define VFIO_REGION_INFO_FLAG_MMAP (1 << 0) > +#define VFIO_REGION_INFO_FLAG_RO (1 << 1) > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID (1 << 2) > + __u64 phys; /* physical address of region */ > +}; > + > +#define VFIO_DEVICE_GET_REGION_INFO _IOWR(';', 110, struct vfio_region_info) > + > +#define VFIO_DEVICE_GET_NUM_IRQS _IOR(';', 111, int) > + > +struct vfio_irq_info { > + __u32 len; /* length of structure */ > + __u32 index; /* IRQ number */ > + __u32 count; /* number of individual IRQs */ > + __u32 flags; > +#define VFIO_IRQ_INFO_FLAG_LEVEL (1 << 0) > +}; > + > +#define VFIO_DEVICE_GET_IRQ_INFO _IOWR(';', 112, struct vfio_irq_info) > + > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */ > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS _IOW(';', 113, int) > + > +/* Unmask IRQ index, arg[0] = index */ > +#define VFIO_DEVICE_UNMASK_IRQ _IOW(';', 114, int) > + > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */ > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD _IOW(';', 115, int) > + > +#define VFIO_DEVICE_RESET _IO(';', 116) > + > +struct vfio_dtpath { > + __u32 len; /* length of structure */ > + __u32 index; > + __u64 flags; > +#define VFIO_DTPATH_FLAGS_REGION (1 << 0) > +#define VFIO_DTPATH_FLAGS_IRQ (1 << 1) > + char *path; > +}; > +#define VFIO_DEVICE_GET_DTPATH _IOWR(';', 117, struct vfio_dtpath) > + > +struct vfio_dtindex { > + __u32 len; /* length of structure */ > + __u32 index; > + __u32 prop_type; > + __u32 prop_index; > + __u64 flags; > +#define VFIO_DTINDEX_FLAGS_REGION (1 << 0) > +#define VFIO_DTINDEX_FLAGS_IRQ (1 << 1) > +}; > +#define VFIO_DEVICE_GET_DTINDEX _IOWR(';', 118, struct vfio_dtindex) > + > +#endif /* VFIO_H */ So where is the vfio-pci? Is that a seperate posting? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html