Add idxd vfio mediated device theory of operation documentation. Provide description on mdev design, usage, and why vfio mdev was chosen. Reviewed-by: Ashok Raj <ashok.raj@xxxxxxxxx> Reviewed-by: Kevin Tian <kevin.tian@xxxxxxxxx> Signed-off-by: Dave Jiang <dave.jiang@xxxxxxxxx> --- Documentation/driver-api/vfio/mdev-idxd.rst | 397 +++++++++++++++++++++++++++ MAINTAINERS | 1 2 files changed, 398 insertions(+) create mode 100644 Documentation/driver-api/vfio/mdev-idxd.rst diff --git a/Documentation/driver-api/vfio/mdev-idxd.rst b/Documentation/driver-api/vfio/mdev-idxd.rst new file mode 100644 index 000000000000..9bf93eafc7c8 --- /dev/null +++ b/Documentation/driver-api/vfio/mdev-idxd.rst @@ -0,0 +1,397 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +IDXD Overview +============= +IDXD (Intel Data Accelerator Driver) is the driver for the Intel Data +Streaming Accelerator (DSA). Intel DSA is a high performance data copy +and transformation accelerator. In addition to data move operations, +the device also supports data fill, CRC generation, Data Integrity Field +(DIF), and memory compare and delta generation. Intel DSA supports +a variety of PCI-SIG defined capabilities such as Address Translation +Services (ATS), Process address Space ID (PASID), Page Request Interface +(PRI), Message Signalled Interrupts Extended (MSI-X), and Advanced Error +Reporting (AER). Some of those capabilities enable the device to support +Shared Virtual Memory (SVM), or also known as Shared Virtual Addressing +(SVA). Intel DSA also supports Intel Scalable I/O Virtualization (SIOV) +to improve scalability of device assignment. + + +The Intel DSA device contains the following basic components: +* Work queue (WQ) + + A WQ is an on device storage to queue descriptors to the + device. Requests are added to a WQ by using new CPU instructions + (MOVDIR64B and ENQCMD(S)) to write the memory mapped “portal” + associated with each WQ. + +* Engine + + Operation unit that pulls descriptors from WQs and processes them. + +* Group + + Abstract container to associate one or more engines with one or more WQs. + + +Two types of WQs are supported: +* Dedicated WQ (DWQ) + + A single client should owns this exclusively and can submit work + to it. The MOVDIR64B instruction is used to submit descriptors to + this type of WQ. The instruction is a posted write, therefore the + submitter must ensure not exceed the WQ length for submission. The + use of PASID is optional with DWQ. Multiple clients can submit to + a DWQ, but sychronization is required due to when the WQ is full, + the submission is silently dropped. + +* Shared WQ (SWQ) + + Multiple clients can submit work to this WQ. The submitter must use + ENQMCDS (from supervisor mode) or ENQCMD (from user mode). These + instructions will indicate via EFLAGS.ZF bit whether a submission + succeeds. The use of PASID is mandatory to identify the address space + of each client. + + +For more information about the new instructions [1][2]. + +The IDXD driver is broken down into following usages: +* In kernel interface through dmaengine subsystem API. +* Userspace DMA support through character device. mmap(2) is utilized + to map directly to mmio address (or portals) for descriptor submission. +* VFIO Mediated device (mdev) supporting device passthrough usages. This + is only for the mdev usage. + + +================================= +Assignable Device Interface (ADI) +================================= +The term ADI is used to represent the minimal unit of assignment for +Intel Scalable IOV device. Each ADI instance refers to the set of device +backend resources that are allocated, configured and organized as an +isolated unit. + +Intel DSA defines each WQ as an ADI. The MMIO registers of each work queue +are partitioned into two categories: +* MMIO registers accessed for data-path operations. +* MMIO registers accessed for control-path operations. + +Data-path MMIO registers of each WQ are contained within +one or more system page size aligned regions and can be mapped in the +CPU page table for direct access from the guest. Control-path MMIO +registers of all WQs are located together but segregated from data-path +MMIO regions. Therefore, guest updates to control-path registers must +be intercepted and then go through the host driver to be reflected in +the device. + +Data-path MMIO registers of DSA WQ are portals for submitting descriptors +to the device. There are four portals per WQ, each being 64 bytes +in size and located on a separate 4KB page in BAR2. Each portal has +different implications regarding interrupt message type (MSI vs. IMS) +and occupancy control (limited vs. unlimited). It is not necessary to +map all portals to the guest. + +Control-path MMIO registers of DSA WQ include global configurations +(shared by all WQs) and WQ-specific configurations. The owner +(e.g. the guest) of the WQ is expected to only change WQ-specific +configurations. Intel DSA spec introduces a “Configuration Support” +capability which, if cleared, indicates that some fields of WQ +configuration registers are read-only and the WQ configuration is +pre-configured by the host. + + +Interrupt Message Store (IMS) +----------------------------- +The ADI utilizes Interrupt Message Store (IMS), a device-specific MSI +implementation, instead of MSIX for interrupts for the guest. This +preserves MSIX for host usages and also allows a significantly larger +number of interrupt vectors for large number of guests usage. + +Intel DSA device implements IMS as on-device memory mapped unified +storage. Each interrupt message is stored as a DWORD size data payload +and a 64-bit address (same as MSI-X). Access to the IMS is through the +host idxd driver. + +The idxd driver makes use of the generic IMS irq chip and domain which +stores the interrupt messages in an array in device memory. Allocation and +freeing of interrupts happens via the generic msi_domain_alloc/free_irqs() +interface. Driver only needs to ensure the interrupt domain is stored in +the underlying device struct. + + +ADI Isolation +------------- +Operations or functioning of one ADI must not affect the functioning +of another ADI or the physical device. Upstream memory requests from +different ADIs are distinguished using a Process Address Space Identifier +(PASID). With the support of PASID-granular address translation in Intel +VT-d, the address space targeted by a request from ADI can be a Host +Virtual Address (HVA), Host I/O Virtual Address (HIOVA), Guest Physical +Address (GPA), Guest Virtual Address (GVA), Guest I/O Virtual Address +(GIOVA), etc. The PASID identity for an ADI is expected to be accessed +or modified by privileged software through the host driver. + +========================= +Virtual DSA (vDSA) Device +========================= +The DSA WQ itself is not a PCI device thus must be composed into a +virtual DSA device to the guest. + +The composition logic needs to handle four main requirements: +* Emulate PCI config space. +* Map data-path portals for direct access from the guest. +* Emulate control-path MMIO registers and selectively forward WQ + configuration requests through host driver to the device. +* Forward and emulate WQ interrupts to the guest. + +The composition logic tells the guest aspects of WQ which are configurable +through a combination of capability fields, e.g.: +* Configuration Support (if cleared, most aspects are not modifiable). +* WQ Mode Support (if cleared, cannot change between dedicated and + shared mode). +* Dedicated Mode Support. +* Shared Mode Support. +* ... + +The virtual capability fields are set according to the vDSA +type. Following is an example of vDSA types and related WQ configurability: +* Type ‘1dwq-v1’ + * One DSA gen1 dedicated WQ to this guest + * Guest cannot share the WQ between its clients (no guest SVA) + * Guest cannot change any WQ configuration + +Besides, the composition logic also needs to serve administrative commands +(thru virtual CMD register) through host driver, including: +* Drain/abort all descriptors submitted by this guest. +* Drain/abort descriptors associated with a PASID. +* Enable/disable/reset the WQ (when it’s not shared by multiple VMs). +* Request interrupt handle. + +With this design, vDSA emulation is **greatly simplified**. Most +registers are emulated in simple READ-ONLY flavor, and handling limited +configurability is required only for a few registers. + +=========================== +VFIO mdev vs. userspace DMA +=========================== +There are two avenues to support vDSA composition. +1. VFIO mediated device (mdev) +2. Userspace DMA through char device + +VFIO mdev provides a generic subdevice passthrough framework. Unified +uAPIs are used for both device and subdevice passthrough, thus any +userspace VMM which already supports VFIO device passthrough would +naturally support mdev/subdevice passthrough. The implication of VFIO +mdev is putting emulation of device interface in the kernel (part of +host driver) which must be carefully scrutinized. Fortunately, vDSA +composition includes only a small portion of emulation code, due to the +fact that most registers are simply READ-ONLY to the guest. The majority +logic of handling limited configurability and administrative commands +is anyway required to sit in the kernel, regardless of which kernel uAPI +is pursued. In this regard, VFIO mdev is a nice fit for vDSA composition. + +IDXD driver provides a char device interface for applications to +map the WQ portal and directly submit descriptors to do DMA. This +interface provides only data-path access to userspace and relies on +the host driver to handle control-path configurations. Expanding such +interface to support subdevice passthrough allows moving the emulation +code to userspace. However, quite some work is required to grow it from +an application-oriented interface into a passthrough-oriented interface: +new uAPIs to handle guest WQ configurability and administrative commands, +and new uAPIs to handle passthrough specific requirements (e.g. DMA map, +guest SVA, live migration, posted interrupt, etc.). And once it is done, +every userspace VMM has to explicitly bind to IDXD specific uAPI, even +though the real user is in the guest (instead of the VMM itself) in the +passthrough scenario. + +Although some generalization might be possible to reduce the work of +handling passthrough, we feel the difference between userspace DMA +and subdevice passthrough is distinct in IDXD. Therefore, we choose to +build vDSA composition on top of VFIO mdev framework and leave userspace +DMA intact after discussion at LPC 2020. + +============================= +Host Registration and Release +============================= + +Intel DSA reports support for Intel Scalable IOV via a PCI Express +Designated Vendor Specific Extended Capability (DVSEC). In addition, +PASID-granular address translation capability is required in the +IOMMU. During host initialization, the IDXD driver should check the +presence of both capabilities before calling mdev_register_device() +to register with the VFIO mdev framework and provide a set of ops +(struct mdev_parent_ops). The IOMMU capability is indicated by the +IOMMU_DEV_FEAT_AUX feature flag with iommu_dev_has_feature() and enabled +with iommu_dev_enable_feature(). + +On release, iommu_dev_disable_feature() is called after +mdev_unregister_device() to disable the IOMMU_DEV_FEAT_AUX flag that +the driver enabled during host initialization. + +The mdev_parent_ops data structure is filled out by the driver to provide +a number of ops called by VFIO mdev framework:: + + struct mdev_parent_ops { + .supported_type_groups + .create + .remove + .open + .release + .read + .write + .mmap + .ioctl + }; + +Supported_type_groups +--------------------- +At the moment only one vDSA type is supported. + +“1dwq-v1”: + Single dedicated WQ (DSA 1.0) with read-only configuration exposed to + the guest. On the guest kernel, a vDSA device shows up with a single + WQ that is pre-configured by the host. The configuration for the WQ + is entirely read-only and cannot be reconfigured. There is no support + of guest SVA on this WQ. + + The interrupt vector 0 is emulated by the host driver to support the admin + command completion and error reporting. A second interrupt vector is + bound to the IMS and used for I/O operation. In this implementation, + there are only two vectors being supported. + +create +------ +API function to create the mdev. mdev_set_iommu_device() is called to +associate the mdev device to the parent PCI device. This function is +where the driver sets up and initializes the resources to support a single +mdev device. This is triggered through sysfs to initiate the creation. + +remove +------ +API function that mirrors the create() function and releases all the +resources backing the mdev. This is also triggered through sysfs. + +open +---- +API function that is called down from VFIO userspace to indicate to the +driver that the upper layers are ready to claim and utilize the mdev. IMS +entries are allocated and setup here. + +release +------- +The mirror function to open that releases the mdev by VFIO userspace. + +read / write +------------ +This is where the Intel IDXD driver provides read/write emulation of +PCI config space and MMIO registers. These paths are the “slow” path +of the mediated device and emulation is used rather than direct access +to the hardware resources. Typically configuration and administrative +commands go through this path. This allows the mdev to show up as a +virtual PCI device on the guest kernel. + +The emulation of PCI config space is nothing special, which is simply +copied from kvmgt. In the future this part might be consolidated to +reduce duplication. + +Emulating MMIO reads are simply memory copies. There is no side-effect +to be emulated upon guest read. + +Emulating MMIO writes are required only for a few registers, due to +read-only configuration on the ‘1dwq-v1’ type. Majority of composition +logic is hooked in the CMD register for performing administrative commands +such as WQ drain, abort, enable, disable and reset operations. The rest of +the emulation is about handling errors (GENCTRL/SWERROR) and interrupts +(INTCAUSE/MSIXPERM) on the vDSA device. Future mdev types might allow +limited WQ configurability, which then requires additional emulation of +the WQCFG register. + +mmap +---- +This is the function that provides the setup to expose a portion of the +hardware, also known as portals, for direct access for “fast” path +operations through the mmap() syscall. A limited region of the hardware +is mapped to the guest for direct I/O submission. + +There are four portals per WQ: unlimited MSI-X, limited MSI-X, unlimited +IMS, limited IMS. Descriptors submitted to limited portals are subject +to threshold configuration limitations for shared WQs. The MSI-X portals +are used for host submissions, and the IMS portals are mapped to vm for +guest submission. + +ioctl +----- +This API function does several things +* Provides general device information to VFIO userspace. +* Provides device region information (PCI, mmio, etc). +* Get interrupts information +* Setup interrupts for the mediated device. +* Mdev device reset + +For the Intel idxd driver, Interrupt Message Store (IMS) vectors are being +used for mdev interrupts rather than MSIX vectors. IMS provides additional +interrupt vectors outside of PCI MSIX specification in order to support +significantly more vectors. The emulated interrupt (0) is connected through +kernel eventfd. When interrupt 0 needs to be asserted, the driver will +signal the eventfd to trigger vector 0 interrupt on the guest. +The IMS interrupts are setup via eventfd as well. However, it utilizes +irq bypass manager to directly inject the interrupt in the guest. + +To allocate IMS, we utilize the IMS array APIs. On host init, we need +to create the MSI domain:: + + struct ims_array_info ims_info; + struct device *dev = &pci_dev->dev; + + + /* assign the device IMS size */ + ims_info.max_slots = max_ims_size; + /* assign the MMIO base address for the IMS table */ + ims_info.slots = mmio_base + ims_offset; + /* assign the MSI domain to the device */ + dev->msi_domain = pci_ims_array_create_msi_irq_domain(pci_dev, &ims_info); + +When we are ready to allocate the interrupts:: + + struct device *dev = mdev_dev(mdev); + + irq_domain = pci_dev->dev.msi_domain; + /* the irqs are allocated against device of mdev */ + rc = msi_domain_alloc_irqs(irq_domain, dev, num_vecs); + + + /* we can retrieve the slot index from msi_entry */ + for_each_msi_entry(entry, dev) { + slot_index = entry->device_msi.hwirq; + irq = entry->irq; + } + + request_irq(irq, interrupt_handler_function, 0, “ims”, context); + + +The DSA device is structured such that MSI-X table entry 0 is used for +admin commands completion, error reporting, and other misc commands. The +remaining MSI-X table entries are used for WQ completion. For vm support, +the virtual device also presents a similar layout. Therefore, vector 0 +is emulated by the software. Additional vector(s) are associated with IMS. + +The index (slot) for the per device IMS entry is managed by the MSI +core. The index is the “interrupt handle” that the guest kernel +needs to program into a DMA descriptor. That interrupt handle tells the +hardware which IMS vector to trigger the interrupt on for the host. + +The virtual device presents an admin command called “request interrupt +handle” that is not supported by the physical device. On probe of +the DSA device on the guest kernel, the guest driver will issue the +“request interrupt handle” command in order to get the interrupt +handle for descriptor programming. The host driver will return the +assigned slot for the IMS entry table to the guest. + +========== +References +========== +[1] https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html +[2] https://software.intel.com/en-us/articles/intel-sdm +[3] https://software.intel.com/sites/default/files/managed/cc/0e/intel-scalable-io-virtualization-technical-specification.pdf +[4] https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification diff --git a/MAINTAINERS b/MAINTAINERS index c2114daa6bc7..ae34b0331eb4 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8970,6 +8970,7 @@ INTEL IADX DRIVER M: Dave Jiang <dave.jiang@xxxxxxxxx> L: dmaengine@xxxxxxxxxxxxxxx S: Supported +F: Documentation/driver-api/vfio/mdev-idxd.rst F: drivers/dma/idxd/* F: include/uapi/linux/idxd.h