On Tue, 24 Oct 2023 at 21:51, Joao Martins <joao.m.martins@xxxxxxxxxx> wrote: > > v6 is a replacement of what's in iommufd next: > https://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd.git/log/?h=for-next > > base-commit: b5f9e63278d6f32789478acf1ed41d21d92b36cf > > (from the iommufd tree) > > =========>8========= > > Presented herewith is a series that extends IOMMUFD to have IOMMU > hardware support for dirty bit in the IOPTEs. > > Today, AMD Milan (or more recent) supports it while ARM SMMUv3.2 > alongside VT-D rev3.x also do support. One intended use-case (but not > restricted!) is to support Live Migration with SR-IOV, specially useful > for live migrateable PCI devices that can't supply its own dirty > tracking hardware blocks amongst others. > > At a quick glance, IOMMUFD lets the userspace create the IOAS with a > set of a IOVA ranges mapped to some physical memory composing an IO > pagetable. This is then created via HWPT_ALLOC or attached to a > particular device/hwpt, consequently creating the IOMMU domain and share > a common IO page table representing the endporint DMA-addressable guest > address space. In IOMMUFD Dirty tracking (since v2 of the series) it > will require via the HWPT_ALLOC model only, as opposed to simpler > autodomains model. > > The result is an hw_pagetable which represents the > iommu_domain which will be directly manipulated. The IOMMUFD UAPI, > and the iommu/iommufd kAPI are then extended to provide: > > 1) Enforcement that only devices with dirty tracking support are attached > to an IOMMU domain, to cover the case where this isn't all homogenous in > the platform. While initially this is more aimed at possible heterogenous nature > of ARM while x86 gets future proofed, should any such ocasion occur. > > The device dirty tracking enforcement on attach_dev is made whether the > dirty_ops are set or not. Given that attach always checks for dirty > ops and IOMMU_CAP_DIRTY, while writing this it almost wanted this to > move to upper layer but semantically iommu driver should do the > checking. > > 2) Toggling of Dirty Tracking on the iommu_domain. We model as the most > common case of changing hardware translation control structures dynamically > (x86) while making it easier to have an always-enabled mode. In the > RFCv1, the ARM specific case is suggested to be always enabled instead of > having to enable the per-PTE DBM control bit (what I previously called > "range tracking"). Here, setting/clearing tracking means just clearing the > dirty bits at start. The 'real' tracking of whether dirty > tracking is enabled is stored in the IOMMU driver, hence no new > fields are added to iommufd pagetable structures, except for the > iommu_domain dirty ops part via adding a dirty_ops field to > iommu_domain. We use that too for IOMMUFD to know if dirty tracking > is supported and toggleable without having iommu drivers replicate said > checks. > > 3) Add a capability probing for dirty tracking, leveraging the > per-device iommu_capable() and adding a IOMMU_CAP_DIRTY. It extends > the GET_HW_INFO ioctl which takes a device ID to return some generic > capabilities *in addition*. Possible values enumarated by `enum > iommufd_hw_capabilities`. > > 4) Read the I/O PTEs and marshal its dirtyiness into a bitmap. The bitmap > indexes on a page_size basis the IOVAs that got written by the device. > While performing the marshalling also drivers need to clear the dirty bits > from IOPTE and allow the kAPI caller to batch the much needed IOTLB flush. > There's no copy of bitmaps to userspace backed memory, all is zerocopy > based to not add more cost to the iommu driver IOPT walker. This shares > functionality with VFIO device dirty tracking via the IOVA bitmap APIs. So > far this is a test-and-clear kind of interface given that the IOPT walk is > going to be expensive. In addition this also adds the ability to read dirty > bit info without clearing the PTE info. This is meant to cover the > unmap-and-read-dirty use-case, and avoid the second IOTLB flush. > > The only dependency is: > * Have domain_alloc_user() API with flags [2] already queued (iommufd/for-next). > > The series is organized as follows: > > * Patches 1-4: Takes care of the iommu domain operations to be added. > The idea is to abstract iommu drivers from any idea of how bitmaps are > stored or propagated back to the caller, as well as allowing > control/batching over IOTLB flush. So there's a data structure and an > helper that only tells the upper layer that an IOVA range got dirty. > This logic is shared with VFIO and it's meant to walking the bitmap > user memory, and kmap-ing plus setting bits as needed. IOMMU driver > just has an idea of a 'dirty bitmap state' and recording an IOVA as > dirty. > > * Patches 5-9, 13-18: Adds the UAPIs for IOMMUFD, and selftests. The > selftests cover some corner cases on boundaries handling of the bitmap > and various bitmap sizes that exercise. I haven't included huge IOVA > ranges to avoid risking the selftests failing to execute due to OOM > issues of mmaping big buffers. > > * Patches 10-11: AMD IOMMU implementation, particularly on those having > HDSup support. Tested with a Qemu amd-iommu with HDSUp emulated[0]. And > tested with live migration with VFs (but with IOMMU dirty tracking). > > * Patches 12: Intel IOMMU rev3.x+ implementation. Tested with a Qemu > based intel-iommu vIOMMU with SSADS emulation support[0]. > > On AMD/Intel I have tested this with emulation and then live migration in > AMD hardware; > > The qemu iommu emulation bits are to increase coverage of this code and > hopefully make this more broadly available for fellow contributors/devs, > old version[1]; it uses Yi's 2 commits to have hw_info() supported (still > needs a bit of cleanup) on top of a recent Zhenzhong series of IOMMUFD > QEMU bringup work: see here[0]. It includes IOMMUFD dirty tracking for > Live migration and with live migration tested. I won't be exactly > following up a v2 of QEMU patches until IOMMUFD tracking lands. > > Feedback or any comments are very much appreciated. > > Thanks! > Joao Hi, Joao and Yi I just tried this on aarch64, live migration with "iommu=nested-smmuv3" It does not work. vbasedev->dirty_pages_supported=0 qemu-system-aarch64: -device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0,enable-migration=on,x-pre-copy-dirty-page-tracking=off: warning: 0000:75:00.1: VFIO device doesn't support device and IOMMU dirty tracking qemu-system-aarch64: -device vfio-pci-nohotplug,host=0000:75:00.1,iommufd=iommufd0,enable-migration=on,x-pre-copy-dirty-page-tracking=off: vfio 0000:75:00.1: 0000:75:00.1: Migration is currently not supported with vIOMMU enabled hw/vfio/migration.c if (vfio_viommu_preset(vbasedev)) { error_setg(&err, "%s: Migration is currently not supported " "with vIOMMU enabled", vbasedev->name); goto add_blocker; } Is this mean live migration with vIOMMU is still not ready, It is not an error. It is how they are blocking migration till all other related feature support is added for vIOMMU. And still need more work to enable migration with vIOMMU? By the way, live migration works if removing "iommu=nested-smmuv3". Any suggestions? Thanks