On Fri, Apr 07, 2017 at 08:17:47PM +0100, Jean-Philippe Brucker wrote: > Here I propose a few ideas for extensions and optimizations. This is all > very exploratory, feel free to correct mistakes and suggest more things. > > I. Linux host > 1. vhost-iommu A qemu based implementation would be a first step. Would allow validating the claim that it's much simpler to support than e.g. VTD. > 2. VFIO nested translation > II. Page table sharing > 1. Sharing IOMMU page tables > 2. Sharing MMU page tables (SVM) > 3. Fault reporting > 4. Host implementation with VFIO > III. Relaxed operations > IV. Misc > > > I. Linux host > ============= > > 1. vhost-iommu > -------------- > > An advantage of virtualizing an IOMMU using virtio is that it allows to > hoist a lot of the emulation code into the kernel using vhost, and avoid > returning to userspace for each request. The mainline kernel already > implements vhost-net, vhost-scsi and vhost-vsock, and a lot of core code > could be reused. > > Introducing vhost in a simplified scenario 1 (removed guest userspace > pass-through, irrelevant to this example) gives us the following: > > MEM____pIOMMU________PCI device____________ HARDWARE > | \ > ----------|-------------+-------------+-----\-------------------------- > | : KVM : \ > pIOMMU drv : : \ KERNEL > | : : net drv > VFIO : : / > | : : / > vhost-iommu_________________________virtio-iommu-drv > : : > --------------------------------------+------------------------------- > HOST : GUEST > > > Introducing vhost in scenario 2, userspace now only handles the device > initialisation part, and most runtime communication is handled in kernel: > > MEM__pIOMMU___PCI device HARDWARE > | | > -------|---------|------+-------------+------------------------------- > | | : KVM : > pIOMMU drv | : : KERNEL > \__net drv : : > | : : > tap : : > | : : > _vhost-net________________________virtio-net drv > (2) / : : / (1a) > / : : / > vhost-iommu________________________________virtio-iommu drv > : : (1b) > ------------------------+-------------+------------------------------- > HOST : GUEST > > (1) a. Guest virtio driver maps ring and buffers > b. Map requests are relayed to the host the same way. > (2) To access any guest memory, vhost-net must query the IOMMU. We can > reuse the existing TLB protocol for this. TLB commands are written to > and read from the vhost-net fd. > > As defined in Linux/include/uapi/linux/vhost.h, the vhost msg structure > has everything needed for map/unmap operations: > > struct vhost_iotlb_msg { > __u64 iova; > __u64 size; > __u64 uaddr; > __u8 perm; /* R/W */ > __u8 type; > #define VHOST_IOTLB_MISS > #define VHOST_IOTLB_UPDATE /* MAP */ > #define VHOST_IOTLB_INVALIDATE /* UNMAP */ > #define VHOST_IOTLB_ACCESS_FAIL > }; > > struct vhost_msg { > int type; > union { > struct vhost_iotlb_msg iotlb; > __u8 padding[64]; > }; > }; > > The vhost-iommu device associates a virtual device ID to a TLB fd. We > should be able to use the same commands for [vhost-net <-> virtio-iommu] > and [virtio-net <-> vhost-iommu] communication. A virtio-net device > would open a socketpair and hand one side to vhost-iommu. > > If vhost_msg is ever used for another purpose than TLB, we'll have some > trouble, as there will be multiple clients that want to read/write the > vhost fd. A multicast transport method will be needed. Until then, this > can work. > > Details of operations would be: > > (1) Userspace sets up vhost-iommu as with other vhost devices, by using > standard vhost ioctls. Userspace starts by describing the system topology > via ioctl: > > ioctl(iommu_fd, VHOST_IOMMU_ADD_DEVICE, struct > vhost_iommu_add_device) > > #define VHOST_IOMMU_DEVICE_TYPE_VFIO > #define VHOST_IOMMU_DEVICE_TYPE_TLB > > struct vhost_iommu_add_device { > __u8 type; > __u32 devid; > union { > struct vhost_iommu_device_vfio { > int vfio_group_fd; > }; > struct vhost_iommu_device_tlb { > int fd; > }; > }; > }; > > (2) VIRTIO_IOMMU_T_ATTACH(address space, devid) > > vhost-iommu creates an address space if necessary, finds the device along > with the relevant operations. If type is VFIO, operations are done on a > container, otherwise they are done on single devices. > > (3) VIRTIO_IOMMU_T_MAP(address space, virt, phys, size, flags) > > Turn phys into an hva using the vhost mem table. > > - If type is TLB, either preload with VHOST_IOTLB_UPDATE or store the > mapping locally and wait for the TLB to ask for it with a > VHOST_IOTLB_MISS. > - If type is VFIO, turn it into a VFIO_IOMMU_MAP_DMA (might need to > introduce a shortcut in the external user API of VFIO). > > (4) VIRTIO_IOMMU_T_UNMAP(address space, virt, phys, size, flags) > > - If type is TLB, send a VHOST_IOTLB_INVALIDATE. > - If type is VFIO, turn it into VFIO_IOMMU_UNMAP_DMA. > > (5) VIRTIO_IOMMU_T_DETACH(address space, devid) > > Undo whatever was done in (2). > > > 2. VFIO nested translation > -------------------------- > > For my current kvmtool implementation, I am putting each VFIO group in a > different container during initialization. We cannot detach a group from a > container at runtime without first resetting all devices in that group. So > the best way to provide dynamic address spaces right now is one container > per group. The drawback is that we need to maintain multiple sets of page > tables even if the guest wants to put all devices in the same address > space. Another disadvantage is when implementing bypass mode, we need to > map the whole address space at the beginning, then unmap everything on > attach. Adding nested support would be a nice way to provide dynamic > address spaces while keeping groups tied to a container at all times. > > A physical IOMMU may offer nested translation. In this case, address > spaces are managed by two page directories instead of one. A guest- > virtual address is translated into a guest-physical one using what we'll > call here "stage-1" (s1) page tables, and the guest-physical address is > translated into a host-physical one using "stage-2" (s2) page tables. > > s1 s2 > GVA --> GPA --> HPA > > There isn't a lot of support in Linux for nesting IOMMU page directories > at the moment (though SVM support is coming, see II). VFIO does have a > "nesting" IOMMU type, which doesn't mean much at the moment. The ARM SMMU > code uses this to decide whether to manage the container with s2 page > tables instead of s1, but even then we still only have a single stage and > it is assumed that IOVA=GPA. > > Another model that would help with dynamically changing address spaces is > nesting VFIO containers: > > Parent <---------- map/unmap > container > / | \ > / group \ > Child Child <--- map/unmap > container container > | | | > group group group > > At the beginning all groups are attached to the parent container, and > there is no child container. Doing map/unmap on the parent container maps > stage-2 page tables (map GPA -> HVA and pin the page -> HPA). User should > be able to choose whether they want all devices attached to this container > to be able to access GPAs (bypass mode, as it currently is) or simply > block all DMA (in which case there is no need to pin pages here). > > At some point the guest wants to create an address space and attaches > children to it. Using an ioctl (to be defined), we can derive a child > container from the parent container, and move groups from parent to child. > > This returns a child fd. When the guest maps something in this new address > space, we can do a map ioctl on the child container, which maps stage-1 > page tables (map GVA -> GPA). > > A page table walk may access multiple levels of tables (pgd, p4d, pud, > pmd, pt). With nested translation, each access to a table during the > stage-1 walk requires a stage-2 walk. This makes a full translation costly > so it is preferable to use a single stage of translation when possible. > Folding two stages into one is simple with a single container, as shown in > the kvmtool example. The host keeps track of GPA->HVA mappings, so it can > fold the full GVA->HVA mapping before sending the VFIO request. With > nested containers however, the IOMMU driver would have to do the folding > work itself. Keeping a copy of stage-2 mapping created on the parent > container, it would fold them into the actual stage-2 page tables when > receiving a map request on the child container (note that software folding > is not possible when stage-1 pgd is managed by the guest, as described in > next section). > > I don't know if nested VFIO containers are a desirable feature at all. I > find the concept cute on paper, and it would make it easier for userspace > to juggle with address spaces, but it might require some invasive changes > in VFIO, and people have been able to use the current API for IOMMU > virtualization so far. > > > II. Page table sharing > ====================== > > 1. Sharing IOMMU page tables > ---------------------------- > > VIRTIO_IOMMU_F_PT_SHARING > > This is independent of the nested mode described in I.2, but relies on a > similar feature in the physical IOMMU: having two stages of page tables, > one for the host and one for the guest. > > When this is supported, the guest can manage its own s1 page directory, to > avoid sending MAP/UNMAP requests. Feature VIRTIO_IOMMU_F_PT_SHARING allows > a driver to give a page directory pointer (pgd) to the host and send > invalidations when removing or changing a mapping. In this mode, three > requests are used: probe, attach and invalidate. An address space cannot > be using the MAP/UNMAP interface and PT_SHARING at the same time. > > Device and driver first need to negotiate which page table format they > will be using. This depends on the physical IOMMU, so the request contains > a negotiation part to probe the device capabilities. > > (1) Driver attaches devices to address spaces as usual, but a flag > VIRTIO_IOMMU_ATTACH_F_PRIVATE (working title) tells the device not to > create page tables for use with the MAP/UNMAP API. The driver intends > to manage the address space itself. > > (2) Driver sends a PROBE_TABLE request. It sets len > 0 with the size of > pg_format array. > > VIRTIO_IOMMU_T_PROBE_TABLE > > struct virtio_iommu_req_probe_table { > le32 address_space; > le32 flags; > le32 len; > > le32 nr_contexts; > struct { > le32 model; > u8 format[64]; > } pg_format[len]; > }; > > Introducing a probe request is more flexible than advertising those > features in virtio config, because capabilities are dynamic, and depend on > which devices are attached to an address space. Within a single address > space, devices may support different numbers of contexts (PASIDs), and > some may not support recoverable faults. > > (3) Device responds success with all page table formats implemented by the > physical IOMMU in pg_format. 'model' 0 is invalid, so driver can > initialize the array to 0 and deduce from there which entries have > been filled by the device. > > Using a probe method seems preferable over trying to attach every possible > format until one sticks. For instance, with an ARM guest running on an x86 > host, PROBE_TABLE would return the Intel IOMMU page table format, and the > guest could use that page table code to handle its mappings, hidden behind > the IOMMU API. This requires that the page-table code is reasonably > abstracted from the architecture, as is done with drivers/iommu/io-pgtable > (an x86 guest could use any format implement by io-pgtable for example.) > > (4) If the driver is able to use this format, it sends the ATTACH_TABLE > request. > > VIRTIO_IOMMU_T_ATTACH_TABLE > > struct virtio_iommu_req_attach_table { > le32 address_space; > le32 flags; > le64 table; > > le32 nr_contexts; > /* Page-table format description */ > > le32 model; > u8 config[64] > }; > > > 'table' is a pointer to the page directory. 'nr_contexts' isn't used > here. > > For both ATTACH and PROBE, 'flags' are the following (and will be > explained later): > > VIRTIO_IOMMU_ATTACH_TABLE_F_INDIRECT (1 << 0) > VIRTIO_IOMMU_ATTACH_TABLE_F_NATIVE (1 << 1) > VIRTIO_IOMMU_ATTACH_TABLE_F_FAULT (1 << 2) > > Now 'model' is a bit tricky. We need to specify all possible page table > formats and their parameters. I'm not well-versed in x86, s390 or other > IOMMUs, so I'll just focus on the ARM world for this example. We basically > have two page table models, with a multitude of configuration bits: > > * ARM LPAE > * ARM short descriptor > > We could define a high-level identifier per page-table model, such as: > > #define PG_TABLE_ARM 0x1 > #define PG_TABLE_X86 0x2 > ... > > And each model would define its own structure. On ARM 'format' could be a > simple u32 defining a variant, LPAE 32/64 or short descriptor. It could > also contain additional capabilities. Then depending on the variant, > 'config' would be: > > struct pg_config_v7s { > le32 tcr; > le32 prrr; > le32 nmrr; > le32 asid; > }; > > struct pg_config_lpae { > le64 tcr; > le64 mair; > le32 asid; > > /* And maybe TTB1? */ > }; > > struct pg_config_arm { > le32 variant; > union ...; > }; > > I am really uneasy with describing all those nasty architectural details > in the virtio-iommu specification. We certainly won't start describing the > content bit-by-bit of tcr or mair here, but just declaring these fields > might be sufficient. > > (5) Once the table is attached, the driver can simply write the page > tables and expect the physical IOMMU to observe the mappings without > any additional request. When changing or removing a mapping, however, > the driver must send an invalidate request. > > VIRTIO_IOMMU_T_INVALIDATE > > struct virtio_iommu_req_invalidate { > le32 address_space; > le32 context; > le32 flags; > le64 virt_addr; > le64 range_size; > > u8 opaque[64]; > }; > > 'flags' may be: > > VIRTIO_IOMMU_INVALIDATE_T_VADDR: invalidate a single VA range > from 'context' (context is 0 when !F_INDIRECT). > > And with context tables only (explained below): > > VIRTIO_IOMMU_INVALIDATE_T_SINGLE: invalidate all mappings from > 'context' (context is 0 when !F_INDIRECT). virt_addr and range_size > are ignored. > > VIRTIO_IOMMU_INVALIDATE_T_TABLE: with F_INDIRECT, invalidate entries > in the table that changed. Device reads the table again, compares it > to previous values, and invalidate all mappings for contexts that > changed. context, virt_addr and range_size are ignored. > > IOMMUs may offer hints and quirks in their invalidation packets. The > opaque structure in invalidate would allow to transport those. This > depends on the page table format and as with architectural page-table > definitions, I really don't want to have those details in the spec itself. > > > 2. Sharing MMU page tables > -------------------------- > > The guest can share process page-tables with the physical IOMMU. To do > that, it sends PROBE_TABLE with (F_INDIRECT | F_NATIVE | F_FAULT). The > page table format is implicit, so the pg_format array can be empty (unless > the guest wants to query some specific property, e.g. number of levels > supported by the pIOMMU?). If the host answers with success, guest can > send its MMU page table details with ATTACH_TABLE and (F_NATIVE | > F_INDIRECT | F_FAULT) flags. > > F_FAULT means that the host communicates page requests from device to the > guest, and the guest can handle them by mapping virtual address in the > fault to pages. It is only available with VIRTIO_IOMMU_F_FAULT_QUEUE (see > below.) > > F_NATIVE means that the pIOMMU pgtable format is the same as guest MMU > pgtable format. > > F_INDIRECT means that 'table' pointer is a context table, instead of a > page directory. Each slot in the context table points to a page directory: > > 64 2 1 0 > table ----> +---------------------+ > | pgd |0|1|<--- context 0 > | --- |0|0|<--- context 1 > | pgd |0|1| > | --- |0|0| > | --- |0|0| > +---------------------+ > | \___Entry is valid > |______reserved > > Question: do we want per-context page table format, or can it stay global > for the whole indirect table? > > Having a context table allows to provide multiple address spaces for a > single device. In the simplest form, without F_INDIRECT we have a single > address space per device, but some devices may implement more, for > instance devices with the PCI PASID extension. > > A slot's position in the context table gives an ID, between 0 and > nr_contexts. The guest can use this ID to have the device target a > specific address space with DMA. The mechanism to do that is > device-specific. For a PCI device, the ID is a PASID, and PCI doesn't > define a specific way of using them for DMA, it's the device driver's > concern. > > > 3. Fault reporting > ------------------ > > VIRTIO_IOMMU_F_EVENT_QUEUE > > With this feature, an event virtqueue (1) is available. For now it will > only be used for fault handling, but I'm calling it eventq so that other > asynchronous features can piggy-back on it. Device may report faults and > page requests by sending buffers via the used ring. > > #define VIRTIO_IOMMU_T_FAULT 0x05 > > struct virtio_iommu_evt_fault { > struct virtio_iommu_evt_head { > u8 type; > u8 reserved[3]; > }; > > u32 address_space; > u32 context; > > u64 vaddr; > u32 flags; /* Access details: R/W/X */ > > /* In the reply: */ > u32 reply; /* Fault handled, or failure */ > u64 paddr; > }; > > Driver must send the reply via the request queue, with the fault status > in 'reply', and the mapped page in 'paddr' on success. > > Existing fault handling interfaces such as PRI have a tag (PRG) allowing > to identify a page request (or group thereof) when sending a reply. I > wonder if this would be useful to us, but it seems like the > (address_space, context, vaddr) tuple is sufficient to identify a page > fault, provided the device doesn't send duplicate faults. Duplicate faults > could be required if they have a side effect, for instance implementing a > poor man's doorbell. If this is desirable, we could add a fault_id field. > > > 4. Host implementation with VFIO > -------------------------------- > > The VFIO interface for sharing page tables is being worked on at the > moment by Intel. Other virtual IOMMU implementation will most likely let > guest manage full context tables (PASID tables) themselves, giving the > context table pointer to the pIOMMU via a VFIO ioctl. > > For the architecture-agnostic virtio-iommu however, we shouldn't have to > implement all possible formats of context table (they are at least > different between ARM SMMU and Intel IOMMU, and will certainly be extended > in future physical IOMMU architectures.) In addition, most users might > only care about having one page directory per device, as SVM is a luxury > at the moment and few devices support it. For these reasons, we should > allow to pass single page directories via VFIO, using very similar > structures as described above, whilst reusing the VFIO channel developed > for Intel vIOMMU. > > * VFIO_SVM_INFO: probe page table formats > * VFIO_SVM_BIND: set pgd and arch-specific configuration > > There is an inconvenient with letting the pIOMMU driver manage the guest's > context table. During a page table walk, the pIOMMU translates the context > table pointer using the stage-2 page tables. The context table must > therefore be mapped in guest-physical space by the pIOMMU driver. One > solution is to let the pIOMMU driver reserve some GPA space upfront using > the iommu and sysfs resv API [1]. The host would then carve that region > out of the guest-physical space using a firmware mechanism (for example DT > reserved-memory node). > > > III. Relaxed operations > ======================= > > VIRTIO_IOMMU_F_RELAXED > > Adding an IOMMU dramatically reduces performance of a device, because > map/unmap operations are costly and produce a lot of TLB traffic. For > significant performance improvements, device might allow the driver to > sacrifice safety for speed. In this mode, the driver does not need to send > UNMAP requests. The semantics of MAP change and are more complex to > implement. Given a MAP([start:end] -> phys, flags) request: > > (1) If [start:end] isn't mapped, request succeeds as usual. > (2) If [start:end] overlaps an existing mapping [old_start:old_end], we > unmap [max(start, old_start):min(end, old_end)] and replace it with > [start:end]. > (3) If [start:end] overlaps an existing mapping that matches the new map > request exactly (same flags, same phys address), the old mapping is > kept. > > This squashing could be performed by the guest. The driver can catch unmap > requests from the DMA layer, and only relay map requests for (1) and (2). > A MAP request is therefore able to split and partially override an > existing mapping, which isn't allowed in non-relaxed mode. UNMAP requests > are unnecessary, but are now allowed to split or carve holes in mappings. > > In this model, a MAP request may take longer, but we may have a net gain > by removing a lot of redundant requests. Squashing series of map/unmap > performed by the guest for the same mapping improves temporal reuse of > IOVA mappings, which I can observe by simply dumping IOMMU activity of a > virtio device. It reduce the number of TLB invalidations to the strict > minimum while keeping correctness of DMA operations (provided the device > obeys its driver). There is a good read on the subject of optimistic > teardown in paper [2]. > > This model is completely unsafe. A stale DMA transaction might access a > page long after the device driver in the guest unmapped it and > decommissioned the page. The DMA transaction might hit into a completely > different part of the system that is now reusing the page. Existing > relaxed implementations attempt to mitigate the risk by setting a timeout > on the teardown. Unmap requests from device drivers are not discarded > entirely, but buffered and sent at a later time. Paper [2] reports good > results with a 10ms delay. > > We could add a way for device and driver to negotiate a vulnerability > window to mitigate the risk of DMA attacks. Driver might not accept a > window at all, since it requires more infrastructure to keep delayed > mappings. In my opinion, it should be made clear that regardless of the > duration of this window, any driver accepting F_RELAXED feature makes the > guest completely vulnerable, and the choice boils down to either isolation > or speed, not a bit of both. > > > IV. Misc > ======== > > I think we have enough to go on for a while. To improve MAP throughput, I > considered adding a MAP_SG request depending on a feature bit, with > variable size: > > struct virtio_iommu_req_map_sg { > struct virtio_iommu_req_head; > u32 address_space; > u32 nr_elems; > u64 virt_addr; > u64 size; > u64 phys_addr[nr_elems]; > }; > > Would create the following mappings: > > virt_addr -> phys_addr[0] > virt_addr + size -> phys_addr[1] > virt_addr + 2 * size -> phys_addr[2] > ... > > This would avoid the overhead of multiple map commands. We could try to > find a more cunning format to compress virtually-contiguous mappings with > different (phys, size) pairs as well. But Linux drivers rarely prefer > map_sg() functions over regular map(), so I don't know if the whole map_sg > feature is worth the effort. All we would gain is a few bytes anyway. > > My current map_sg implementation in the virtio-iommu driver adds a batch > of map requests to the queue and kick the host once. That might be enough > of an optimization. > > > Another invasive optimization would be adding grouped requests. By adding > two flags in the header, L and G, we can group sequences of requests > together, and have one status at the end, either 0 if all requests in the > group succeeded, or the status of the first request that failed. This is > all in-order. Requests in a group follow each others, there is no sequence > identifier. > > ___ L: request is last in the group > / _ G: request is part of a group > | / > v v > 31 9 8 7 0 > +--------------------------------+ <------- RO descriptor > | res0 |0|1| type | > +--------------------------------+ > | payload | > +--------------------------------+ > | res0 |0|1| type | > +--------------------------------+ > | payload | > +--------------------------------+ > | res0 |0|1| type | > +--------------------------------+ > | payload | > +--------------------------------+ > | res0 |1|1| type | > +--------------------------------+ > | payload | > +--------------------------------+ <------- WO descriptor > | res0 | status | > +--------------------------------+ > > This adds some complexity on the device, since it must unroll whatever was > done by successful requests in a group as soon as one fails, and reject > all subsequent ones. A group of requests is an atomic operation. As with > map_sg, this change mostly allows to save space and virtio descriptors. > > > [1] https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-iommu_groups > [2] vIOMMU: Efficient IOMMU Emulation > N. Amit, M. Ben-Yehuda, D. Tsafrir, A. Schuster