Presented herewith is a series that extends IOMMUFD to have IOMMU hardware support for dirty bit in the IOPTEs. Today, AMD Milan (which been out for a year now) supports it while ARM SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along. The intended use-case is to support Live Migration with SR-IOV, with IOMMUs that support it. Yishai Hadas will be soon submiting an RFC that covers the PCI device dirty tracker via vfio. At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a set of a IOVA ranges mapped to some physical memory composing an IO pagetable. This is then attached to a particular device, consequently creating the protection domain to share a common IO page table representing the endporint DMA-addressable guest address space. (Hopefully I am not twisting the terminology here) The resultant object is an hw_pagetable object which represents the iommu_domain object that will be directly manipulated. For more background on IOMMUFD have a look at these two series[0][1] on the kernel and qemu consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core kAPI is then extended to provide: 1) Enabling or disabling dirty tracking on the iommu_domain. Model as the most common case of changing hardware protection domain control bits, and ARM specific case of having to enable the per-PTE DBM control bit. The 'real' tracking of whether dirty tracking is enabled or not is stored in the vendor IOMMU, hence no new fields are added to iommufd pagetable structures. 2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap thus describe the IOVAs that got written by the device. While performing the marshalling also vendors need to clear the dirty bits from IOPTE and allow the kAPI caller to batch the much needed IOTLB flush. There's no copy of bitmaps to userspace backed memory, all is zerocopy based. So far this is a test-and-clear kind of interface given that the IOPT walk is going to be expensive. It occured to me to separate the readout of dirty, and the clearing of dirty from IOPTEs. I haven't opted for that one, given that it would mean two lenghty IOPTE walks and felt counter-performant. 3) Unmapping an IOVA range while returning its dirty bit prior to unmap. This case is specific for non-nested vIOMMU case where an erronous guest (or device) DMAing to an address being unmapped at the same time. [See at the end too, on general remarks, specifically the one regarding probing dirty tracking via a dedicated iommufd cap ioctl] The series is organized as follows: * Patches 1-3: Takes care of the iommu domain operations to be added and extends iommufd io-pagetable to set/clear dirty tracking, as well as reading the dirty bits from the vendor pagetables. The idea is to abstract iommu vendors from any idea of how bitmaps are stored or propagated back to the caller, as well as allowing control/batching over IOTLB flush. So there's a data structure and an helper that only tells the upper layer that an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking the bitmap user memory, and kmap-ing them as needed. IOMMU vendor just has an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the vendor IOMMU implementor. * Patches 4-5: Adds the new unmap domain op that returns whether the IOVA got dirtied. I separated this from the rest of the set, as I am still questioning the need for this API and whether this race needs to be fundamentally be handled. I guess the thinking is that live-migration should be guest foolproof, but how much the race happens in pratice to deem this as a necessary unmap variant. Perhaps maybe it might be enough fetching dirty bits prior to the unmap? Feedback appreciated. * Patches 6-8: Adds the UAPIs for IOMMUFD, vfio-compat and selftests. We should discuss whether to include the vfio-compat or not. Given how vfio-type1-iommu perpectually dirties any IOVA, and here I am replacing with the IOMMU hw support. I haven't implemented the perpectual dirtying given his lack of usefullness over an IOMMU-backed implementation (or so I think). The selftests, test mainly the principal workflow, still needs to get added more corner cases. Note: Given that there's no capability for new APIs, or page sizes or etc, the userspace app using IOMMUFD native API would gather -EOPNOTSUPP when dirty tracking is not supported by the IOMMU hardware. For completeness and most importantly to make sure the new IOMMU core ops capture the hardware blocks, all the IOMMUs that will eventually get IOMMU A/D support were implemented. So the next half of the series presents *proof of concept* implementations for IOMMUs: * Patches 9-11: AMD IOMMU implementation, particularly on those having HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated, and also on a AMD Milan server IOMMU. * Patches 12-17: Adapts the past series from Keqian Zhu[2] but reworked to do the dynamic set/clear dirty tracking, and immplicitly clearing dirty bits on the readout. Given the lack of hardware and difficulty to get this in an emulated SMMUv3 (given the dependency on the PE HTTU and BBML2, IIUC) then this is only compiled tested. Hopefully I am not getting the attribution wrong. * Patches 18-19: Intel IOMMU rev3.x implementation. Tested with a Qemu based intel-iommu with SSADS/SLADS emulation support. To help testing/prototypization, qemu iommu emulation bits were written to increase coverage of this code and hopefully make this more broadly available for fellow contributors/devs. A separate series is submitted right after this covering the Qemu IOMMUFD extensions for dirty tracking, alongside its x86 iommus emulation A/D bits. Meanwhile it's also on github (https://github.com/jpemartins/qemu/commits/iommufd) Remarks / Observations: * There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks what has access in each of the newly added ops. Initially I was thinking to have a HWPT_GET_DIRTY to probe how dirty tracking is supported (rather than bailing out with EOPNOTSUP) as well as an get_dirty_tracking iommu-core API. On the UAPI, perhaps it might be better to have a single API for capabilities in general (similar to KVM) and at the simplest is a subop where the necessary info is conveyed on a per-subop basis? * The UAPI/kAPI could be generalized over the next iteration to also cover Access bit (or Intel's Extended Access bit that tracks non-CPU usage). It wasn't done, as I was not aware of a use-case. I am wondering if the access-bits could be used to do some form of zero page detection (to just send the pages that got touched), although dirty-bits could be used just the same way. Happy to adjust for RFCv2. The algorithms, IOPTE walk and marshalling into bitmaps as well as the necessary IOTLB flush batching are all the same. The focus is on dirty bit given that the dirtyness IOVA feedback is used to select the pages that need to be transfered to the destination while migration is happening. Sidebar: Sadly, there's a lot less clever possible tricks that can be done (compared to the CPU/KVM) without having the PCI device cooperate (like userfaultfd, wrprotect, etc as those would turn into nepharious IOMMU perm faults and devices with DMA target aborts). If folks thing the UAPI/iommu-kAPI should be agnostic to any PTE A/D bits, we can instead have the ioctls be named after HWPT_SET_TRACKING() and add another argument which asks which bits to enabling tracking (IOMMUFD_ACCESS/IOMMUFD_DIRTY/IOMMUFD_ACCESS_NONCPU). Likewise for the read_and_clear() as all PTE bits follow the same logic as dirty. Happy to readjust if folks think it is worthwhile. * IOMMU Nesting /shouldn't/ matter in this work, as it is expected that we only care about the first stage of IOMMU pagetables for hypervisors i.e. tracking dirty GPAs (and not caring about dirty GIOVAs). * Dirty bit tracking only, is not enough. Large IO pages tend to be the norm when DMA mapping large ranges of IOVA space, when really the VMM wants the smallest granularity possible to track(i.e. host base pages). A separate bit of work will need to take care demoting IOPTE page sizes at guest-runtime to increase/decrease the dirty tracking granularity, likely under the form of a IOAS demote/promote page-size within a previously mapped IOVA range. Feedback is very much appreciated! [0] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6-iommufd_jgg@xxxxxxxxxx/ [1] https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@xxxxxxxxx/ [2] https://lore.kernel.org/linux-arm-kernel/20210413085457.25400-1-zhukeqian1@xxxxxxxxxx/ Joao TODOs: * More selftests for large/small iopte sizes; * Better vIOMMU+VFIO testing (AMD doesn't support it); * Performance efficiency of GET_DIRTY_IOVA in various workloads; * Testing with a live migrateable VF; Jean-Philippe Brucker (1): iommu/arm-smmu-v3: Add feature detection for HTTU Joao Martins (16): iommu: Add iommu_domain ops for dirty tracking iommufd: Dirty tracking for io_pagetable iommufd: Dirty tracking data support iommu: Add an unmap API that returns dirtied IOPTEs iommufd: Add a dirty bitmap to iopt_unmap_iova() iommufd: Dirty tracking IOCTLs for the hw_pagetable iommufd/vfio-compat: Dirty tracking IOCTLs compatibility iommufd: Add a test for dirty tracking ioctls iommu/amd: Access/Dirty bit support in IOPTEs iommu/amd: Add unmap_read_dirty() support iommu/amd: Print access/dirty bits if supported iommu/arm-smmu-v3: Add read_and_clear_dirty() support iommu/arm-smmu-v3: Add set_dirty_tracking_range() support iommu/arm-smmu-v3: Add unmap_read_dirty() support iommu/intel: Access/Dirty bit support for SL domains iommu/intel: Add unmap_read_dirty() support Kunkun Jiang (2): iommu/arm-smmu-v3: Add feature detection for BBML iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping drivers/iommu/amd/amd_iommu.h | 1 + drivers/iommu/amd/amd_iommu_types.h | 11 + drivers/iommu/amd/init.c | 12 +- drivers/iommu/amd/io_pgtable.c | 100 +++++++- drivers/iommu/amd/iommu.c | 99 ++++++++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 135 +++++++++++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 14 ++ drivers/iommu/intel/iommu.c | 152 +++++++++++- drivers/iommu/intel/pasid.c | 76 ++++++ drivers/iommu/intel/pasid.h | 7 + drivers/iommu/io-pgtable-arm.c | 232 ++++++++++++++++-- drivers/iommu/iommu.c | 71 +++++- drivers/iommu/iommufd/hw_pagetable.c | 79 ++++++ drivers/iommu/iommufd/io_pagetable.c | 253 +++++++++++++++++++- drivers/iommu/iommufd/io_pagetable.h | 3 +- drivers/iommu/iommufd/ioas.c | 35 ++- drivers/iommu/iommufd/iommufd_private.h | 59 ++++- drivers/iommu/iommufd/iommufd_test.h | 9 + drivers/iommu/iommufd/main.c | 9 + drivers/iommu/iommufd/pages.c | 79 +++++- drivers/iommu/iommufd/selftest.c | 137 ++++++++++- drivers/iommu/iommufd/vfio_compat.c | 221 ++++++++++++++++- include/linux/intel-iommu.h | 30 +++ include/linux/io-pgtable.h | 20 ++ include/linux/iommu.h | 64 +++++ include/uapi/linux/iommufd.h | 78 ++++++ tools/testing/selftests/iommu/Makefile | 1 + tools/testing/selftests/iommu/iommufd.c | 135 +++++++++++ 28 files changed, 2047 insertions(+), 75 deletions(-) -- 2.17.2