On Fri, Feb 11, 2022 at 05:28:22PM +0000, Joao Martins wrote: > But well, at the end of the day for an IOMMU driver the domain ops are > the important stuff, maybe IO pgtable framework isn't as critical > (Intel for example, doesn't use that at all). Right, it doesn't matter what library was used to implement the domain.. > > User space page tables will not be abstracted and the userspace must > > know the direct HW format of the IOMMU they is being used. > > > That's countering earlier sentence? Because hw format (for me at least) > means PTE and protection domain config format too. And if iommufd > abstracts the HW format, modelling after the IOMMU domain and its ops, > then it's abstracting userspace from those details e.g. it works over > IOVAs, but not over its vendor representation of how that IOVA is set up. > > I am probably being dense. It is both ways, one kind of domain provides a kernel supplied map/unmap that implements the IO PTE manipulation in kernel memory But if you want the IOPTE to be in user memory then user must read/write it and it cannot use that - so a user domain will not have map/unmap. > > It is basically the same, almost certainly the user API in iommufd > > will be some 'get dirty bits' and 'unmap and give me the dirty bits' > > just like vfio has. > > > > The 'unmap and give dirty bits' looks to be something TBD even in a VFIO > migration flow. It is essential to implement any kind of viommu behavior where map/unmap is occuring while the dirty tracking is running. It should never make a difference except in some ugly edge cases where the dma and unmap are racing. > supposed to be happening (excluding P2P)? So perhaps the unmap is > unneeded after quiescing the VF. Yes, you don't need to unmap for migration, a simple fully synchronous read and clear is sufficient. But that final read, while DMA is quite, must obtain the latest dirty bit data. > We have a bitmap based interface in KVM, but there's also a recent ring > interface for dirty tracking, which has some probably more determinism than > a big bitmap. And if we look at hardware, AMD needs to scan NPT pagetables > and breaking its entries on-demand IIRC, whereas Intel resembles something > closer to a 512 entries 'ring' with VMX PML, which tells what has been > dirtied. KVM has an advantage that it can throttle the rate of dirty generation so it can rely on logging. The IOMMU can't throttle DMA, so we are stuck with a bitmap > > I don't know if mmap should be involed here, the dirty bitmaps are not > > so big, I suspect a simple get_user_pages_fast() would be entirely OK. > > > Considering that is 32MB of a bitmap per TB maybe it is cheap. Rigt. GUP fasting a couple huge pages is nothing compared to scanning 1TB of IO page table. > >> Give me some time (few days only, as I gotta sort some things) and I'll > >> respond here as follow up with link to a branch with the WIP/PoC patches. > > > > Great! > > > Here it is. "A few days" turn into a week sorry :/ > > https://github.com/jpemartins/qemu amd-iommu-hdsup-wip > https://github.com/jpemartins/linux amd-vfio-iommu-hdsup-wip > > Note, it is an early PoC. I still need to get the split/collapse thing going, > and fix the FIXMEs there, and have a second good look at the iommu page tables. Oh I'll try to look a this thanks > > You have to mark it as non-present to do the final read out if > > something unmaps while the tracker is on - eg emulating a viommu or > > something. Then you mark non-present, flush the iotlb and read back > > the dirty bit. > > > You would be surprised that AMD IOMMUs have also an accelerated vIOMMU > too :) without needing VMM intervention (that's also not supported > in Linux). I'm sure, but dirty tracking has to happen on the kernel owned page table, not the user owned one I think.. > > Otherwise AFIAK, you flush the IOTLB to get the latest dirty bits and > > then read and clear them. > > > It's the other way around AIUI. The dirty bits are sticky, so you flush > the IOTLB after clearing as means to notify the IOMMU to set the dirty bits > again on the next memory transaction (or ATS translation). I guess it depends on how the HW works, if it writes the dirty bit back to ram instantly on the first dirty, or if it only writes the dirty bit when flushing the iotlb. In any case you have to synchronize with the HW in some way to ensure that all dirty bits are 'pulled back' into system memory to resolve races (ie you need the iommu HW to release and the CPU to acquire on the dirty data) before trying to read them, at least for the final quieced system read. > I am not entirely sure we need to unmap + mark non-present for non-viommu > That would actually mean something is not properly quiscieing the VF DMA. > Maybe we should .. to gate whether if we should actually continue with LM > if something kept doing DMA when it shouldn't have. It is certainly an edge case. A device would be misbehaving to continue DMA. > > This seems like it would be some interesting amount of driver work, > > but yes it could be a generic new iommu_domina op. > > I am slightly at odds that .split and .collapse at .switch() are enough. > But, with iommu if we are working on top of an IOMMU domain object and > .split and .collapse are iommu_ops perhaps that looks to be enough > flexibility to give userspace the ability to decide what it wants to > split, if it starts eargerly/warming-up tracking dirty pages. > > The split and collapsing is something I wanted to work on next, to get > to a stage closer to that of an RFC on the AMD side. split/collapse seems kind of orthogonal to me it doesn't really connect to dirty tracking other than being mostly useful during dirty tracking. And I wonder how hard split is when trying to atomically preserve any dirty bit.. > Hmmm, judging how the IOMMU works I am not sure this is particularly > affecting DMA performance (not sure yet about RDMA, it's something I > curious to see how it gets to perform with 4K IOPTEs, and with dirty > tracking always enabled). Considering how the bits are sticky, and > unless CPU clears it, it's short of a nop? Unless of course the checking > for A^D during an atomic memory transaction is expensive. Needs some > performance testing nonetheless. If you leave the bits all dirty then why do it? The point is to see the dirties, which means the iommu is generating a workload of dirty cachelines while operating it didn't do before. > I forgot to mention, but the early enablement of IOMMU dirty tracking > was also meant to fully know since guest creation what needs to be > sent to the destination. Otherwise, wouldn't we need to send the whole > pinned set to destination, if we only start tracking dirty pages during > migration? ? At the start of migration you have to send everything. Dirty tracking is intended to allow the VM to be stopped and then have only a small set of data to xfer. > Also, this is probably a differentiator for iommufd, if we were to provide > split and collapse semantics to IOMMU domain objects that userspace can use. > That would get more freedom, to switch dirty-tracking, and then do the warm > up thingie and piggy back on what it wants to split before migration. > perhaps the switch() should get some flag to pick where to split, I guess. Yes, right. Split/collapse should be completely seperate from dirty tracking. Jason