Hi, Joao On Tue, 24 Oct 2023 at 21:51, Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: > > v6 is a replacement of what's in iommufd next: > https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next > > base-commit: b5f9e63278d6f32789478acf1ed41d21d92b36cf > > (from the iommufd tree) > > =========>8========= > > Presented herewith is a series that extends IOMMUFD to have IOMMU > hardware support for dirty bit in the IOPTEs. > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2 > alongside VT-D rev3.x also do support. One intended use-case (but not > restricted!) is to support Live Migration with SR-IOV, specially useful > for live migrateable PCI devices that can't supply its own dirty > tracking hardware blocks amongst others. > > At a quick glance, IOMMUFD lets the userspace create the IOAS with a > set of a IOVA ranges mapped to some physical memory composing an IO > pagetable. This is then created via HWPT_ALLOC or attached to a > particular device/hwpt, consequently creating the IOMMU domain and share > a common IO page table representing the endporint DMA-addressable guest > address space. In IOMMUFD Dirty tracking (since v2 of the series) it > will require via the HWPT_ALLOC model only, as opposed to simpler > autodomains model. > > The result is an hw_pagetable which represents the > iommu_domain which will be directly manipulated. The IOMMUFD UAPI, > and the iommu/iommufd kAPI are then extended to provide: > > 1) Enforcement that only devices with dirty tracking support are attached > to an IOMMU domain, to cover the case where this isn't all homogenous in > the platform. While initially this is more aimed at possible heterogenous nature > of ARM while x86 gets future proofed, should any such ocasion occur. > > The device dirty tracking enforcement on attach_dev is made whether the > dirty_ops are set or not. Given that attach always checks for dirty > ops and IOMMU_CAP_DIRTY, while writing this it almost wanted this to > move to upper layer but semantically iommu driver should do the > checking. > > 2) Toggling of Dirty Tracking on the iommu_domain. We model as the most > common case of changing hardware translation control structures dynamically > (x86) while making it easier to have an always-enabled mode. In the > RFCv1, the ARM specific case is suggested to be always enabled instead of > having to enable the per-PTE DBM control bit (what I previously called > "range tracking"). Here, setting/clearing tracking means just clearing the > dirty bits at start. The 'real' tracking of whether dirty > tracking is enabled is stored in the IOMMU driver, hence no new > fields are added to iommufd pagetable structures, except for the > iommu_domain dirty ops part via adding a dirty_ops field to > iommu_domain. We use that too for IOMMUFD to know if dirty tracking > is supported and toggleable without having iommu drivers replicate said > checks. > > 3) Add a capability probing for dirty tracking, leveraging the > per-device iommu_capable() and adding a IOMMU_CAP_DIRTY. It extends > the GET_HW_INFO ioctl which takes a device ID to return some generic > capabilities *in addition*. Possible values enumarated by `enum > iommufd_hw_capabilities`. > > 4) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap > indexes on a page_size basis the IOVAs that got written by the device. > While performing the marshalling also drivers need to clear the dirty bits > from IOPTE and allow the kAPI caller to batch the much needed IOTLB flush. > There's no copy of bitmaps to userspace backed memory, all is zerocopy > based to not add more cost to the iommu driver IOPT walker. This shares > functionality with VFIO device dirty tracking via the IOVA bitmap APIs. So > far this is a test-and-clear kind of interface given that the IOPT walk is > going to be expensive. In addition this also adds the ability to read dirty > bit info without clearing the PTE info. This is meant to cover the > unmap-and-read-dirty use-case, and avoid the second IOTLB flush. > > The only dependency is: > * Have domain_alloc_user() API with flags [2] already queued (iommufd/for-next). > > The series is organized as follows: > > * Patches 1-4: Takes care of the iommu domain operations to be added. > The idea is to abstract iommu drivers from any idea of how bitmaps are > stored or propagated back to the caller, as well as allowing > control/batching over IOTLB flush. So there's a data structure and an > helper that only tells the upper layer that an IOVA range got dirty. > This logic is shared with VFIO and it's meant to walking the bitmap > user memory, and kmap-ing plus setting bits as needed. IOMMU driver > just has an idea of a 'dirty bitmap state' and recording an IOVA as > dirty. > > * Patches 5-9, 13-18: Adds the UAPIs for IOMMUFD, and selftests. The > selftests cover some corner cases on boundaries handling of the bitmap > and various bitmap sizes that exercise. I haven't included huge IOVA > ranges to avoid risking the selftests failing to execute due to OOM > issues of mmaping big buffers. > > * Patches 10-11: AMD IOMMU implementation, particularly on those having > HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated[0]. And > tested with live migration with VFs (but with IOMMU dirty tracking). > > * Patches 12: Intel IOMMU rev3.x+ implementation. Tested with a Qemu > based intel-iommu vIOMMU with SSADS emulation support[0]. > > On AMD/Intel I have tested this with emulation and then live migration in > AMD hardware; > > The qemu iommu emulation bits are to increase coverage of this code and > hopefully make this more broadly available for fellow contributors/devs, > old version[1]; it uses Yi's 2 commits to have hw_info() supported (still > needs a bit of cleanup) on top of a recent Zhenzhong series of IOMMUFD > QEMU bringup work: see here[0]. It includes IOMMUFD dirty tracking for > Live migration and with live migration tested. I won't be exactly > following up a v2 of QEMU patches until IOMMUFD tracking lands. > > Feedback or any comments are very much appreciated. > > Thanks! > Joao Is this patchset enough for iommufd live migration? Just tried live migration in local machine, reports "VFIO migration is not supported in kernel" Thanks