On 2/23/22 01:03, Jason Gunthorpe wrote: > On Tue, Feb 22, 2022 at 11:55:55AM +0000, Joao Martins wrote: >>>> If by conclusion you mean the whole thing to be merged, how can the work be >>>> broken up to pieces if we busy-waiting on the new subsystem? Or maybe you meant >>>> in terms of direction... >>> >>> I think go ahead and build it on top of iommufd, start working out the >>> API details, etc. I think once the direction is concluded the new APIs >>> will go forward. >>> >> /me nods, will do. Looking at your repository it is looking good. > > I would like to come with some plan for dirty tracking on iommufd and > combine that with a plan for dirty tracking inside the new migration > drivers. > I had a few things going on my end over the past weeks, albeit it is getting a bit better now and I will be coming back to this topic. I hope/want to give you a more concrete update/feedback over the coming week or two wrt to dirty-tracking+iommufd+amd. So far, I am not particularly concerned that this will affect overall iommufd design. The main thing is really lookups to get vendor iopte, upon on what might be a iommu_sync_dirty_bitmap(domain, iova, size) API. For toggling the tracking, I'll be simplifying the interface in the other type1 series I had and making it a simple iommu_set_feature(domain, flag, value) behind an ioctl for iommufd that can enable/disable over a domain. Perhaps same trick could be expanded to other features to have a bit more control on what userspace is allowed to use. I think this just needs to set/clear a feature bit in the domain, for VFIO or userspace to have full control during the different stages of migration of dirty tracking. In all of the IOMMU implementations/manuals I read it means setting a protection domain descriptor flag: AMD is a 2-bit field in the DTE, on Intel likewise but on the PASID table only for scalable-mode PTEs, on SMMUv3.2 there's an equivalent (albeit past work had also it always-on). Provided the iommufd does /separately/ more finer granularity on what we can do with page tables. Thus the VMM can demote/promote the ioptes to a lower page size at will as separate operations, before and after migration respectivally. That logic would probably be better to be in separate iommufd ioctls(), as that it's going to be expensive. >> I, too, have been wondering what that is going to look like -- and how do we >> convey the setup of dirty tracking versus the steering of it. > > What I suggested was to just split them. > > Some ioctl toward IOMMUFD will turn on the system iommu tracker - this > would be on a per-domain basis, not on the ioas. > > Some ioctl toward the vfio device will turn on the device's tracker. > In the activation/fetching-data of either trackers I see some things in common in terms of UAPI with the difference that whether a device or a list of devices are passed on as an argument of exiting dirty-track vfio ioctls(). (At least that's how I am reading your suggestion) Albeit perhaps the main difference is going to be that one needs to setup with hardware interface with the device tracker and how we carry the regions of memory that want to be tracked i.e. GPA/IOVA ranges that the device should track. The tracking-GPA space is not linear GPA space sadly. But at the same point perhaps the internal VFIO API between core-VFIO and vendor-VFIO is just reading the @dma ranges we mapped. In IOMMU this is sort of cheap and 'stateless', but on the setup of the device tracker might mean giving all the IOVA ranges to the PF (once?). Perhaps leaving to the vendor driver to pick when to register the IOVA space to be tracked, or perhaps when you turn on the device's tracker. But on all cases, the driver needs some form of awareness of and convey that to the PF for tracking purposes. > Userspace has to iterate and query each enabled tracker. This is not > so bad because the time to make the system call is going to be tiny > compared to the time to marshal XXGB of dirty bits. > Yeap. The high cost is scanning vendor-iommu ioptes and marshaling to a bitmap, following by a smaller cost copying back to userspace (which KVM does too, when it's using a bitmap, same as VFIO today). Maybe could be optimized to either avoid the copy (gup as you mentioned earlier in the thread), or just copying based on the input bitmap (from PF) number of leading zeroes within some threshold. > This makes qemu more complicated because it has to decide what > trackers to turn on, but that is also the point because we do want > userspace to be able to decide. > If the interface wants extending to pass a device or an array of devices (if I understood you correctly), it would free/simplify VFIO from having to concatenate potentially different devices bitmaps into one. Albeit would require optimizing the marshalling a bit more to avoid performing too much copying. > The other idea that has some possible interest is to allow the > trackers to dump their dirty bits into the existing kvm tracker, then > userspace just does a single kvm centric dirty pass. That would probably limit certain more modern options of ring based dirty tracking, as that kvm dirty bitmap is mutually exclusive with kvm dirty ring. Or at least, would require KVM to always use a bitmap during migration/dirty-rate-estimation with the presence of vfio/vdpa devices. It's a nice idea, though. It would require making core-iommu aware of bitmap as external storage for tracking (that is not iommufd as it's a module). Although, perhaps IOMMU sharing pgtables with CPUs would probably better align with reusing KVM existing dirty bitmap interface. But I haven't given much thought on that one. I just remember reading something on the IOMMU manual about this (ISTR that iommu_v2 was made for that purpose).