On 4/29/22 06:45, Tian, Kevin wrote: >> From: Joao Martins <joao.m.martins@xxxxxxxxxx> >> Sent: Friday, April 29, 2022 5:09 AM >> >> Presented herewith is a series that extends IOMMUFD to have IOMMU >> hardware support for dirty bit in the IOPTEs. >> >> Today, AMD Milan (which been out for a year now) supports it while ARM >> SMMUv3.2+ alongside VT-D rev3.x are expected to eventually come along. >> The intended use-case is to support Live Migration with SR-IOV, with > > this should not be restricted to SR-IOV. > True. Should have written PCI Devices as that is orthogonal to SF/S-IOV/SR-IOV. >> IOMMUs >> that support it. Yishai Hadas will be soon submiting an RFC that covers the >> PCI device dirty tracker via vfio. >> >> At a quick glance, IOMMUFD lets the userspace VMM create the IOAS with a >> set of a IOVA ranges mapped to some physical memory composing an IO >> pagetable. This is then attached to a particular device, consequently >> creating the protection domain to share a common IO page table >> representing the endporint DMA-addressable guest address space. >> (Hopefully I am not twisting the terminology here) The resultant object > > Just remove VMM/guest/... since iommufd is not specific to virtualization. > /me nods >> is an hw_pagetable object which represents the iommu_domain >> object that will be directly manipulated. For more background on >> IOMMUFD have a look at these two series[0][1] on the kernel and qemu >> consumption respectivally. The IOMMUFD UAPI, kAPI and the iommu core >> kAPI is then extended to provide: >> >> 1) Enabling or disabling dirty tracking on the iommu_domain. Model >> as the most common case of changing hardware protection domain control > > didn't get what 'most common case' here tries to explain > Most common case because out of the 3 analyzed IOMMUs two of them require per-device context bits (intel, amd) and not page table entries being changed (arm). >> bits, and ARM specific case of having to enable the per-PTE DBM control >> bit. The 'real' tracking of whether dirty tracking is enabled or not is >> stored in the vendor IOMMU, hence no new fields are added to iommufd >> pagetable structures. >> >> 2) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap >> thus describe the IOVAs that got written by the device. While performing >> the marshalling also vendors need to clear the dirty bits from IOPTE and > > s/vendors/iommu drivers/ > OK, I will avoid the `vendor` term going forward. >> allow the kAPI caller to batch the much needed IOTLB flush. >> There's no copy of bitmaps to userspace backed memory, all is zerocopy >> based. So far this is a test-and-clear kind of interface given that the >> IOPT walk is going to be expensive. It occured to me to separate >> the readout of dirty, and the clearing of dirty from IOPTEs. >> I haven't opted for that one, given that it would mean two lenghty IOPTE >> walks and felt counter-performant. > > me too. that doesn't feel like a performant way. > >> >> 3) Unmapping an IOVA range while returning its dirty bit prior to >> unmap. This case is specific for non-nested vIOMMU case where an >> erronous guest (or device) DMAing to an address being unmapped at the >> same time. > > an erroneous attempt like above cannot anticipate which DMAs can > succeed in that window thus the end behavior is undefined. For an > undefined behavior nothing will be broken by losing some bits dirtied > in the window between reading back dirty bits of the range and > actually calling unmap. From guest p.o.v. all those are black-box > hardware logic to serve a virtual iotlb invalidation request which just > cannot be completed in one cycle. > > Hence in reality probably this is not required except to meet vfio > compat requirement. Just in concept returning dirty bits at unmap > is more accurate. > > I'm slightly inclined to abandon it in iommufd uAPI. > OK, it seems I am not far off from your thoughts. I'll see what others think too, and if so I'll remove the unmap_dirty. Because if vfio-compat doesn't get the iommu hw dirty support, then there would be no users of unmap_dirty. >> >> [See at the end too, on general remarks, specifically the one regarding >> probing dirty tracking via a dedicated iommufd cap ioctl] >> >> The series is organized as follows: >> >> * Patches 1-3: Takes care of the iommu domain operations to be added and >> extends iommufd io-pagetable to set/clear dirty tracking, as well as >> reading the dirty bits from the vendor pagetables. The idea is to abstract >> iommu vendors from any idea of how bitmaps are stored or propagated >> back to >> the caller, as well as allowing control/batching over IOTLB flush. So >> there's a data structure and an helper that only tells the upper layer that >> an IOVA range got dirty. IOMMUFD carries the logic to pin pages, walking > > why do we need another pinning here? any page mapped in iommu page > table is supposed to have been pinned already... > The pinning is for user bitmap data, not the IOVAs. This is mainly to avoid doing any copying back to userspace of the bitmap dirty data. And this happens for every 2M of bitmap data (i.e. representing 64G of IOVA space, having one page track 128M of IOVA assuming worst case scenario of base-pages) I think I can't just use/deref user memory bluntly and IOMMU core ought to work with kernel buffers instead. >> the bitmap user memory, and kmap-ing them as needed. IOMMU vendor >> just has >> an idea of a 'dirty bitmap state' and recording an IOVA as dirty by the >> vendor IOMMU implementor. >> >> * Patches 4-5: Adds the new unmap domain op that returns whether the >> IOVA >> got dirtied. I separated this from the rest of the set, as I am still >> questioning the need for this API and whether this race needs to be >> fundamentally be handled. I guess the thinking is that live-migration >> should be guest foolproof, but how much the race happens in pratice to >> deem this as a necessary unmap variant. Perhaps maybe it might be enough >> fetching dirty bits prior to the unmap? Feedback appreciated. > > I think so as aforementioned. > /me nods >> >> * Patches 6-8: Adds the UAPIs for IOMMUFD, vfio-compat and selftests. >> We should discuss whether to include the vfio-compat or not. Given how >> vfio-type1-iommu perpectually dirties any IOVA, and here I am replacing >> with the IOMMU hw support. I haven't implemented the perpectual dirtying >> given his lack of usefullness over an IOMMU-backed implementation (or so >> I think). The selftests, test mainly the principal workflow, still needs >> to get added more corner cases. > > Or in another way could we keep vfio-compat as type1 does today, i.e. > restricting iommu dirty tacking only to iommufd native uAPI? > I suppose? Another option is not exposing the type1 migration capability. See further below. >> >> Note: Given that there's no capability for new APIs, or page sizes or >> etc, the userspace app using IOMMUFD native API would gather - >> EOPNOTSUPP >> when dirty tracking is not supported by the IOMMU hardware. >> >> For completeness and most importantly to make sure the new IOMMU core >> ops >> capture the hardware blocks, all the IOMMUs that will eventually get IOMMU >> A/D >> support were implemented. So the next half of the series presents *proof of >> concept* implementations for IOMMUs: >> >> * Patches 9-11: AMD IOMMU implementation, particularly on those having >> HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated, >> and also on a AMD Milan server IOMMU. >> >> * Patches 12-17: Adapts the past series from Keqian Zhu[2] but reworked >> to do the dynamic set/clear dirty tracking, and immplicitly clearing >> dirty bits on the readout. Given the lack of hardware and difficulty >> to get this in an emulated SMMUv3 (given the dependency on the PE HTTU >> and BBML2, IIUC) then this is only compiled tested. Hopefully I am not >> getting the attribution wrong. >> >> * Patches 18-19: Intel IOMMU rev3.x implementation. Tested with a Qemu >> based intel-iommu with SSADS/SLADS emulation support. >> >> To help testing/prototypization, qemu iommu emulation bits were written >> to increase coverage of this code and hopefully make this more broadly >> available for fellow contributors/devs. A separate series is submitted right >> after this covering the Qemu IOMMUFD extensions for dirty tracking, >> alongside >> its x86 iommus emulation A/D bits. Meanwhile it's also on github >> (https://github.com/jpemartins/qemu/commits/iommufd) >> >> Remarks / Observations: >> >> * There's no capabilities API in IOMMUFD, and in this RFC each vendor tracks > > there was discussion adding device capability uAPI somewhere. > ack let me know if there was snippets to the conversation as I seem to have missed that. >> what has access in each of the newly added ops. Initially I was thinking to >> have a HWPT_GET_DIRTY to probe how dirty tracking is supported (rather >> than >> bailing out with EOPNOTSUP) as well as an get_dirty_tracking >> iommu-core API. On the UAPI, perhaps it might be better to have a single API >> for capabilities in general (similar to KVM) and at the simplest is a subop >> where the necessary info is conveyed on a per-subop basis? > > probably this can be reported as a device cap as supporting of dirty bit is > an immutable property of the iommu serving that device. I wasn't quite sure how this mapped in the rest of potential features to probe in the iommufd grand scheme of things. I'll get properly done for the next iteration. In the kernel, I was wondering if this could be tracked at iommu_domain given that virtually all supporting iommu drivers will need to track dirty-tracking status on a per-domain basis. But that structure is devoid of any state :/ so I suppose each iommu driver tracks in its private structures (which part of me was trying to avoid). > Userspace can > enable dirty tracking on a hwpt if all attached devices claim the support > and kernel will does the same verification. > Sorry to be dense but this is not up to 'devices' given they take no part in the tracking? I guess by 'devices' you mean the software idea of it i.e. the iommu context created for attaching a said physical device, not the physical device itself. > btw do we still want to keep vfio type1 behavior as the fallback i.e. mark > all pinned pages as dirty when iommu dirty support is missing? From uAPI > naming p.o.v. set/clear_dirty_tracking doesn't preclude a special > implementation like vfio type1. > Maybe let's not illude userspace that dirty tracking is supported? I wonder how much of this can be done in userspace without the iommu pretending to be doing said tracking, if all we are doing is setting all IOVAs as dirty. The issue /I think/ with the perpectual dirtyness is that it's not that useful in pratice, and gives a false illusion of any tracking happening. Really looks to be useful in maybe the testing of a vfio-pci vendor driver and one gotta put a gigantic @downtime-limit so large not to make the VMM think that the migration can't converged given the very high rate of dirty pages. For the testing in general, my idea was to have iommu emulation to fill that gap. >> * The UAPI/kAPI could be generalized over the next iteration to also cover >> Access bit (or Intel's Extended Access bit that tracks non-CPU usage). >> It wasn't done, as I was not aware of a use-case. I am wondering >> if the access-bits could be used to do some form of zero page detection >> (to just send the pages that got touched), although dirty-bits could be >> used just the same way. Happy to adjust for RFCv2. The algorithms, IOPTE > > I'm not fan of adding support for uncertain usages. The suggestion above was really because the logic doesn't change much. But I guess no point in fattening UAPI if it's there's no use-case. > Comparing to this > I'd give higher priority to large page break-down as w/o it it's hard to > find real-world deployment on this work. 😊 > Yeap. Once I hash out the comments I get here in terms of direction, that's what I will be focusing next shortly (unless someone else wants to take that adventure). >> walk and marshalling into bitmaps as well as the necessary IOTLB flush >> batching are all the same. The focus is on dirty bit given that the >> dirtyness IOVA feedback is used to select the pages that need to be >> transfered >> to the destination while migration is happening. >> Sidebar: Sadly, there's a lot less clever possible tricks that can be >> done (compared to the CPU/KVM) without having the PCI device cooperate >> (like >> userfaultfd, wrprotect, etc as those would turn into nepharious IOMMU >> perm faults and devices with DMA target aborts). >> If folks thing the UAPI/iommu-kAPI should be agnostic to any PTE A/D >> bits, we can instead have the ioctls be named after >> HWPT_SET_TRACKING() and add another argument which asks which bits to >> enabling tracking >> (IOMMUFD_ACCESS/IOMMUFD_DIRTY/IOMMUFD_ACCESS_NONCPU). >> Likewise for the read_and_clear() as all PTE bits follow the same logic >> as dirty. Happy to readjust if folks think it is worthwhile. >> >> * IOMMU Nesting /shouldn't/ matter in this work, as it is expected that we >> only care about the first stage of IOMMU pagetables for hypervisors i.e. >> tracking dirty GPAs (and not caring about dirty GIOVAs). > > Hypervisor uses second-stage while guest manages first-stage in nesting. > /me nods >> >> * Dirty bit tracking only, is not enough. Large IO pages tend to be the norm >> when DMA mapping large ranges of IOVA space, when really the VMM wants >> the >> smallest granularity possible to track(i.e. host base pages). A separate bit >> of work will need to take care demoting IOPTE page sizes at guest-runtime to >> increase/decrease the dirty tracking granularity, likely under the form of a >> IOAS demote/promote page-size within a previously mapped IOVA range. >> >> Feedback is very much appreciated! > > Thanks for the work! > Thanks for the feedback thus far and in the rest of the patches too! >> >> [0] https://lore.kernel.org/kvm/0-v1-e79cd8d168e8+6- >> iommufd_jgg@xxxxxxxxxx/ >> [1] https://lore.kernel.org/kvm/20220414104710.28534-1-yi.l.liu@xxxxxxxxx/ >> [2] https://lore.kernel.org/linux-arm-kernel/20210413085457.25400-1- >> zhukeqian1@xxxxxxxxxx/ >> >> Joao >> >> TODOs: >> * More selftests for large/small iopte sizes; >> * Better vIOMMU+VFIO testing (AMD doesn't support it); >> * Performance efficiency of GET_DIRTY_IOVA in various workloads; >> * Testing with a live migrateable VF; >> >> Jean-Philippe Brucker (1): >> iommu/arm-smmu-v3: Add feature detection for HTTU >> >> Joao Martins (16): >> iommu: Add iommu_domain ops for dirty tracking >> iommufd: Dirty tracking for io_pagetable >> iommufd: Dirty tracking data support >> iommu: Add an unmap API that returns dirtied IOPTEs >> iommufd: Add a dirty bitmap to iopt_unmap_iova() >> iommufd: Dirty tracking IOCTLs for the hw_pagetable >> iommufd/vfio-compat: Dirty tracking IOCTLs compatibility >> iommufd: Add a test for dirty tracking ioctls >> iommu/amd: Access/Dirty bit support in IOPTEs >> iommu/amd: Add unmap_read_dirty() support >> iommu/amd: Print access/dirty bits if supported >> iommu/arm-smmu-v3: Add read_and_clear_dirty() support >> iommu/arm-smmu-v3: Add set_dirty_tracking_range() support >> iommu/arm-smmu-v3: Add unmap_read_dirty() support >> iommu/intel: Access/Dirty bit support for SL domains >> iommu/intel: Add unmap_read_dirty() support >> >> Kunkun Jiang (2): >> iommu/arm-smmu-v3: Add feature detection for BBML >> iommu/arm-smmu-v3: Enable HTTU for stage1 with io-pgtable mapping >> >> drivers/iommu/amd/amd_iommu.h | 1 + >> drivers/iommu/amd/amd_iommu_types.h | 11 + >> drivers/iommu/amd/init.c | 12 +- >> drivers/iommu/amd/io_pgtable.c | 100 +++++++- >> drivers/iommu/amd/iommu.c | 99 ++++++++ >> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 135 +++++++++++ >> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 14 ++ >> drivers/iommu/intel/iommu.c | 152 +++++++++++- >> drivers/iommu/intel/pasid.c | 76 ++++++ >> drivers/iommu/intel/pasid.h | 7 + >> drivers/iommu/io-pgtable-arm.c | 232 ++++++++++++++++-- >> drivers/iommu/iommu.c | 71 +++++- >> drivers/iommu/iommufd/hw_pagetable.c | 79 ++++++ >> drivers/iommu/iommufd/io_pagetable.c | 253 +++++++++++++++++++- >> drivers/iommu/iommufd/io_pagetable.h | 3 +- >> drivers/iommu/iommufd/ioas.c | 35 ++- >> drivers/iommu/iommufd/iommufd_private.h | 59 ++++- >> drivers/iommu/iommufd/iommufd_test.h | 9 + >> drivers/iommu/iommufd/main.c | 9 + >> drivers/iommu/iommufd/pages.c | 79 +++++- >> drivers/iommu/iommufd/selftest.c | 137 ++++++++++- >> drivers/iommu/iommufd/vfio_compat.c | 221 ++++++++++++++++- >> include/linux/intel-iommu.h | 30 +++ >> include/linux/io-pgtable.h | 20 ++ >> include/linux/iommu.h | 64 +++++ >> include/uapi/linux/iommufd.h | 78 ++++++ >> tools/testing/selftests/iommu/Makefile | 1 + >> tools/testing/selftests/iommu/iommufd.c | 135 +++++++++++ >> 28 files changed, 2047 insertions(+), 75 deletions(-) >> >> -- >> 2.17.2 >