> From: Joao Martins <joao.m.martins@xxxxxxxxxx> > Sent: Friday, April 29, 2022 5:09 AM > > Presented herewith is a series that extends IOMMUFD to have IOMMU > hardware support for dirty bit in the IOPTEs. > > Today, AMD Milan (which been out for a year now) supports it while ARM > SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along. > The intended use-case is to support Live Migration with SR-IOV, with this should not be restricted to SR-IOV. > IOMMUs > that support it. Yishai Hadas will be soon submiting an RFC that covers the > PCI device dirty tracker via vfio. > > At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a > set of a IOVA ranges mapped to some physical memory composing an IO > pagetable. This is then attached to a particular device, consequently > creating the protection domain to share a common IO page table > representing the endporint DMA-addressable guest address space. > (Hopefully I am not twisting the terminology here) The resultant object Just remove VMM/guest/... since iommufd is not specific to virtualization. > is an hw_pagetable object which represents the iommu_domain > object that will be directly manipulated. For more background on > IOMMUFD have a look at these two series[0][1] on the kernel and qemu > consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core > kAPI is then extended to provide: > > 1) Enabling or disabling dirty tracking on the iommu_domain. Model > as the most common case of changing hardware protection domain control didn't get what 'most common case' here tries to explain > bits, and ARM specific case of having to enable the per-PTE DBM control > bit. The 'real' tracking of whether dirty tracking is enabled or not is > stored in the vendor IOMMU, hence no new fields are added to iommufd > pagetable structures. > > 2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap > thus describe the IOVAs that got written by the device. While performing > the marshalling also vendors need to clear the dirty bits from IOPTE and s/vendors/iommu drivers/ > allow the kAPI caller to batch the much needed IOTLB flush. > There's no copy of bitmaps to userspace backed memory, all is zerocopy > based. So far this is a test-and-clear kind of interface given that the > IOPT walk is going to be expensive. It occured to me to separate > the readout of dirty, and the clearing of dirty from IOPTEs. > I haven't opted for that one, given that it would mean two lenghty IOPTE > walks and felt counter-performant. me too. that doesn't feel like a performant way. > > 3) Unmapping an IOVA range while returning its dirty bit prior to > unmap. This case is specific for non-nested vIOMMU case where an > erronous guest (or device) DMAing to an address being unmapped at the > same time. an erroneous attempt like above cannot anticipate which DMAs can succeed in that window thus the end behavior is undefined. For an undefined behavior nothing will be broken by losing some bits dirtied in the window between reading back dirty bits of the range and actually calling unmap. From guest p.o.v. all those are black-box hardware logic to serve a virtual iotlb invalidation request which just cannot be completed in one cycle. Hence in reality probably this is not required except to meet vfio compat requirement. Just in concept returning dirty bits at unmap is more accurate. I'm slightly inclined to abandon it in iommufd uAPI. > > [See at the end too, on general remarks, specifically the one regarding > probing dirty tracking via a dedicated iommufd cap ioctl] > > The series is organized as follows: > > * Patches 1-3: Takes care of the iommu domain operations to be added and > extends iommufd io-pagetable to set/clear dirty tracking, as well as > reading the dirty bits from the vendor pagetables. The idea is to abstract > iommu vendors from any idea of how bitmaps are stored or propagated > back to > the caller, as well as allowing control/batching over IOTLB flush. So > there's a data structure and an helper that only tells the upper layer that > an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking why do we need another pinning here? any page mapped in iommu page table is supposed to have been pinned already... > the bitmap user memory, and kmap-ing them as needed. IOMMU vendor > just has > an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the > vendor IOMMU implementor. > > * Patches 4-5: Adds the new unmap domain op that returns whether the > IOVA > got dirtied. I separated this from the rest of the set, as I am still > questioning the need for this API and whether this race needs to be > fundamentally be handled. I guess the thinking is that live-migration > should be guest foolproof, but how much the race happens in pratice to > deem this as a necessary unmap variant. Perhaps maybe it might be enough > fetching dirty bits prior to the unmap? Feedback appreciated. I think so as aforementioned. > > * Patches 6-8: Adds the UAPIs for IOMMUFD, vfio-compat and selftests. > We should discuss whether to include the vfio-compat or not. Given how > vfio-type1-iommu perpectually dirties any IOVA, and here I am replacing > with the IOMMU hw support. I haven't implemented the perpectual dirtying > given his lack of usefullness over an IOMMU-backed implementation (or so > I think). The selftests, test mainly the principal workflow, still needs > to get added more corner cases. Or in another way could we keep vfio-compat as type1 does today, i.e. restricting iommu dirty tacking only to iommufd native uAPI? > > Note: Given that there's no capability for new APIs, or page sizes or > etc, the userspace app using IOMMUFD native API would gather - > EOPNOTSUPP > when dirty tracking is not supported by the IOMMU hardware. > > For completeness and most importantly to make sure the new IOMMU core > ops > capture the hardware blocks, all the IOMMUs that will eventually get IOMMU > A/D > support were implemented. So the next half of the series presents *proof of > concept* implementations for IOMMUs: > > * Patches 9-11: AMD IOMMU implementation, particularly on those having > HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated, > and also on a AMD Milan server IOMMU. > > * Patches 12-17: Adapts the past series from Keqian Zhu[2] but reworked > to do the dynamic set/clear dirty tracking, and immplicitly clearing > dirty bits on the readout. Given the lack of hardware and difficulty > to get this in an emulated SMMUv3 (given the dependency on the PE HTTU > and BBML2, IIUC) then this is only compiled tested. Hopefully I am not > getting the attribution wrong. > > * Patches 18-19: Intel IOMMU rev3.x implementation. Tested with a Qemu > based intel-iommu with SSADS/SLADS emulation support. > > To help testing/prototypization, qemu iommu emulation bits were written > to increase coverage of this code and hopefully make this more broadly > available for fellow contributors/devs. A separate series is submitted right > after this covering the Qemu IOMMUFD extensions for dirty tracking, > alongside > its x86 iommus emulation A/D bits. Meanwhile it's also on github > (https://github.com/jpemartins/qemu/commits/iommufd) > > Remarks / Observations: > > * There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks there was discussion adding device capability uAPI somewhere. > what has access in each of the newly added ops. Initially I was thinking to > have a HWPT_GET_DIRTY to probe how dirty tracking is supported (rather > than > bailing out with EOPNOTSUP) as well as an get_dirty_tracking > iommu-core API. On the UAPI, perhaps it might be better to have a single API > for capabilities in general (similar to KVM) and at the simplest is a subop > where the necessary info is conveyed on a per-subop basis? probably this can be reported as a device cap as supporting of dirty bit is an immutable property of the iommu serving that device. Userspace can enable dirty tracking on a hwpt if all attached devices claim the support and kernel will does the same verification. btw do we still want to keep vfio type1 behavior as the fallback i.e. mark all pinned pages as dirty when iommu dirty support is missing? From uAPI naming p.o.v. set/clear_dirty_tracking doesn't preclude a special implementation like vfio type1. > > * The UAPI/kAPI could be generalized over the next iteration to also cover > Access bit (or Intel's Extended Access bit that tracks non-CPU usage). > It wasn't done, as I was not aware of a use-case. I am wondering > if the access-bits could be used to do some form of zero page detection > (to just send the pages that got touched), although dirty-bits could be > used just the same way. Happy to adjust for RFCv2. The algorithms, IOPTE I'm not fan of adding support for uncertain usages. Comparing to this I'd give higher priority to large page break-down as w/o it it's hard to find real-world deployment on this work. 😊 > walk and marshalling into bitmaps as well as the necessary IOTLB flush > batching are all the same. The focus is on dirty bit given that the > dirtyness IOVA feedback is used to select the pages that need to be > transfered > to the destination while migration is happening. > Sidebar: Sadly, there's a lot less clever possible tricks that can be > done (compared to the CPU/KVM) without having the PCI device cooperate > (like > userfaultfd, wrprotect, etc as those would turn into nepharious IOMMU > perm faults and devices with DMA target aborts). > If folks thing the UAPI/iommu-kAPI should be agnostic to any PTE A/D > bits, we can instead have the ioctls be named after > HWPT_SET_TRACKING() and add another argument which asks which bits to > enabling tracking > (IOMMUFD_ACCESS/IOMMUFD_DIRTY/IOMMUFD_ACCESS_NONCPU). > Likewise for the read_and_clear() as all PTE bits follow the same logic > as dirty. Happy to readjust if folks think it is worthwhile. > > * IOMMU Nesting /shouldn't/ matter in this work, as it is expected that we > only care about the first stage of IOMMU pagetables for hypervisors i.e. > tracking dirty GPAs (and not caring about dirty GIOVAs). Hypervisor uses second-stage while guest manages first-stage in nesting. > > * Dirty bit tracking only, is not enough. Large IO pages tend to be the norm > when DMA mapping large ranges of IOVA space, when really the VMM wants > the > smallest granularity possible to track(i.e. host base pages). A separate bit > of work will need to take care demoting IOPTE page sizes at guest-runtime to > increase/decrease the dirty tracking granularity, likely under the form of a > IOAS demote/promote page-size within a previously mapped IOVA range. > > Feedback is very much appreciated! Thanks for the work! > > [0] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6- > iommufd_jgg@xxxxxxxxxx/ > [1] https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@xxxxxxxxx/ > [2] https://lore.kernel.org/linux-arm-kernel/20210413085457.25400-1- > zhukeqian1@xxxxxxxxxx/ > > Joao > > TODOs: > * More selftests for large/small iopte sizes; > * Better vIOMMU+VFIO testing (AMD doesn't support it); > * Performance efficiency of GET_DIRTY_IOVA in various workloads; > * Testing with a live migrateable VF; > > Jean-Philippe Brucker (1): > iommu/arm-smmu-v3: Add feature detection for HTTU > > Joao Martins (16): > iommu: Add iommu_domain ops for dirty tracking > iommufd: Dirty tracking for io_pagetable > iommufd: Dirty tracking data support > iommu: Add an unmap API that returns dirtied IOPTEs > iommufd: Add a dirty bitmap to iopt_unmap_iova() > iommufd: Dirty tracking IOCTLs for the hw_pagetable > iommufd/vfio-compat: Dirty tracking IOCTLs compatibility > iommufd: Add a test for dirty tracking ioctls > iommu/amd: Access/Dirty bit support in IOPTEs > iommu/amd: Add unmap_read_dirty() support > iommu/amd: Print access/dirty bits if supported > iommu/arm-smmu-v3: Add read_and_clear_dirty() support > iommu/arm-smmu-v3: Add set_dirty_tracking_range() support > iommu/arm-smmu-v3: Add unmap_read_dirty() support > iommu/intel: Access/Dirty bit support for SL domains > iommu/intel: Add unmap_read_dirty() support > > Kunkun Jiang (2): > iommu/arm-smmu-v3: Add feature detection for BBML > iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping > > drivers/iommu/amd/amd_iommu.h | 1 + > drivers/iommu/amd/amd_iommu_types.h | 11 + > drivers/iommu/amd/init.c | 12 +- > drivers/iommu/amd/io_pgtable.c | 100 +++++++- > drivers/iommu/amd/iommu.c | 99 ++++++++ > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 135 +++++++++++ > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 14 ++ > drivers/iommu/intel/iommu.c | 152 +++++++++++- > drivers/iommu/intel/pasid.c | 76 ++++++ > drivers/iommu/intel/pasid.h | 7 + > drivers/iommu/io-pgtable-arm.c | 232 ++++++++++++++++-- > drivers/iommu/iommu.c | 71 +++++- > drivers/iommu/iommufd/hw_pagetable.c | 79 ++++++ > drivers/iommu/iommufd/io_pagetable.c | 253 +++++++++++++++++++- > drivers/iommu/iommufd/io_pagetable.h | 3 +- > drivers/iommu/iommufd/ioas.c | 35 ++- > drivers/iommu/iommufd/iommufd_private.h | 59 ++++- > drivers/iommu/iommufd/iommufd_test.h | 9 + > drivers/iommu/iommufd/main.c | 9 + > drivers/iommu/iommufd/pages.c | 79 +++++- > drivers/iommu/iommufd/selftest.c | 137 ++++++++++- > drivers/iommu/iommufd/vfio_compat.c | 221 ++++++++++++++++- > include/linux/intel-iommu.h | 30 +++ > include/linux/io-pgtable.h | 20 ++ > include/linux/iommu.h | 64 +++++ > include/uapi/linux/iommufd.h | 78 ++++++ > tools/testing/selftests/iommu/Makefile | 1 + > tools/testing/selftests/iommu/iommufd.c | 135 +++++++++++ > 28 files changed, 2047 insertions(+), 75 deletions(-) > > -- > 2.17.2