On Fri, Feb 04, 2022 at 07:53:12PM +0000, Joao Martins wrote: > On 2/3/22 15:18, Jason Gunthorpe wrote: > > On Wed, Feb 02, 2022 at 07:05:02PM +0000, Joao Martins wrote: > >> On 2/2/22 17:03, Jason Gunthorpe wrote: > >>> how to integrate that with the iommufd work, which I hope will allow > >>> that series, and the other IOMMU drivers that can support this to be > >>> merged.. > >> > >> The iommu-fd thread wasn't particularly obvious on how dirty tracking is done > >> there, but TBH I am not up to speed on iommu-fd yet so I missed something > >> obvious for sure. When you say 'integrate that with the iommufd' can you > >> expand on that? > > > > The general idea is that iommufd is the place to put all the iommu > > driver uAPI for consumption by userspace. The IOMMU feature of dirty > > tracking would belong there. > > > > So, some kind of API needs to be designed to meet the needs of the > > IOMMU drivers. > > > /me nods > > I am gonna assume below is the most up-to-date to iommufd (as you pointed > out in another thread IIRC): > > https://github.com/jgunthorpe/linux iommufd > > Let me know if it's not :) The iommufd part is pretty good, but there is hacky patch to hook it into vfio that isn't there, if you want to actually try it. > > But, as you say, it looks unnatural and inefficient when the domain > > itself is storing the dirty bits inside the IOPTE. > > How much is this already represented as the io-pgtable in IOMMU internal kAPI > (if we exclude the UAPI portion of iommufd for now) ? FWIW, that is today > used by the AMD IOMMU and ARM IOMMUs. Albeit, not Intel :( Which are you looking at? AFACIT there is no diry page support in iommu_ops ? > then potentially VMM/process can more efficiently scan the dirtied > set? But if some layer needs to somehow mediate between the vendor > IOPTE representation and an UAPI IOPTE representation, to be able to > make that delegation to userspace ... then maybe both might be > inefficient? I didn't see how iommu-fd would abstract the IOPTEs > lookup as far as I glanced through the code, perhaps that's another > ioctl(). It is based around the same model as VFIO container - map/unmap of user address space into the IOPTEs and the user space doesn't see anything resembling a 'pte' - at least for kernel owned IO page tables. User space page tables will not be abstracted and the userspace must know the direct HW format of the IOMMU they is being used. > But what strikes /specifically/ on the dirty bit feature is that it looks > simpler with the current VFIO, the heavy lifting seems to be > mostly on the IOMMU vendor. The proposed API above for VFIO looking at > the container (small changes), and IOMMU vendor would do most of it: It is basically the same, almost certainly the user API in iommufd will be some 'get dirty bits' and 'unmap and give me the dirty bits' just like vfio has. The tricky details are around how do you manage this when the system may have multiple things invovled capable, or not, of actualy doing dirty tracking. > At the same time, what particularly scares me perf-wise (for the > device being migrated) ... is the fact that we need to dynamically > split and collapse page tables to increase the granularity of which > we track. In the above interface it splits/collapses when you turn > on/off the dirty tracking (respectively). That's *probably* where we > need more flexibility, not sure. For sure that is a particularly big adventure in the iommu driver.. > Do you have thoughts on what such device-dirty interface could look like? > (Perhaps too early to poke while the FSM/UAPI is being worked out) I've been thinking the same general read-and-clear of a dirty bitmap. It matches nicely the the KVM interface. > I was wondering if container has a dirty scan/sync callback funnelled > by a vendor IOMMU ops implemented (as Shameerali patches proposed), Yes, this is almost certainly how the in-kernel parts will look > and vfio vendor driver provides one per device. But this is less clear.. > Or propagate the dirty tracking API to vendor vfio driver[*]. > [*] considering the device may choose where to place its tracking storage, and > which scheme (bitmap, ring, etc) it might be. This has been my thinking, yes > The reporting of the dirtying, though, looks hazzy to achieve if you > try to make it uniform even to userspace. Perhaps with iommu-fd > you're thinking to mmap() the dirty region back to userspace, or an > iommu-fd ioctl() updates the PTEs, while letting the kernel clear > the dirty status via the mmap() object. And that would be the common > API regardless of dirty-hw scheme. Anyway, just thinking out loud. My general thinking has be that iommufd would control only the system IOMMU hardware. The FD interface directly exposes the iommu_domain as a manipulable object, so I'd imagine making userspace have a simple 1:1 connection to the iommu_ops of a single iommu_domain. Doing this avoids all the weirdo questions about what do you do if there is non-uniformity in the iommu_domain's. Keeping with that theme the vfio_device would provide a similar interface, on its own device FD. I don't know if mmap should be involed here, the dirty bitmaps are not so big, I suspect a simple get_user_pages_fast() would be entirely OK. > > VFIO proposed to squash everything > > into the container code, but I've been mulling about having iommufd > > only do system iommu and push the PCI device internal tracking over to > > VFIO. > > > > Seems to me that the juicy part falls mostly in IOMMU vendor code, I am > not sure yet how much one can we 'offload' to a generic layer, at least > compared with this other proposal. Yes, I expect there is very little generic code here if we go this way. The generic layer is just marshalling the ioctl(s) to the iommu drivers. Certainly not providing storage or anything/ > Give me some time (few days only, as I gotta sort some things) and I'll > respond here as follow up with link to a branch with the WIP/PoC patches. Great! > 3) Dirty bit is sticky, hardware never clears it. Reading the access/dirty > bit is cheap, clearing them is 'expensive' because one needs to flush > IOTLB as the IOMMU hardware may cache the bits in the IOTLB as a result > of an address-translation/io-page-walk. Even though the IOMMU uses interlocked > operations to actually update the Access/Dirty bit in concurrency with > the CPU. The AMD manuals are a tad misleading as they talk about marking > non-present, but that would be catastrophic for migration as it would > mean a DMA target abort for the PCI device, unless I missed something obvious. > In any case, this means that the dirty bit *clearing* needs to be > batched as much as possible, to amortize the cost of flushing the IOTLB. > This is the same for Intel *IIUC*. You have to mark it as non-present to do the final read out if something unmaps while the tracker is on - eg emulating a viommu or something. Then you mark non-present, flush the iotlb and read back the dirty bit. Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and then read and clear them. > 4) Adjust the granularity of pagetables in place: > [This item wasn't done, but it is generic to any IOMMU because it > is mostly the ability to split existing IO pages in place.] This seems like it would be some interesting amount of driver work, but yes it could be a generic new iommu_domina op. > 4.b) Optionally starting dirtying earlier (at provisioning) and let > userspace dynamically split pages. This is to hopefully minimize the > IOTLB miss we induce ourselves in item 4.a) if we were to do eagerly. > So dirty tracking would be enabled at creation of the protection domain > after the vfio container is set up, and we would use pages dirtied > as a indication of what needs to be splited. Problem is for IO page > sizes bigger than 1G, which might unnecessarily lead to marking too > much as dirty early on; but at least it's better than transferring the > whole set. I'm not sure running with dirty tracking permanently on would be good for guest performance either. I'd suspect you'd be better to have a warm up period where you track dirtys and split down pages. It is interesting, this is a possible reason why device dirty tracking might actually perfom better because it can operate at a different granularity from the system iommu without disrupting the guest DMA performance. Jason