On Mon, Dec 04, 2023, Jason Gunthorpe wrote: > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > > > There are more approaches beyond having IOMMUFD and KVM be > > completely separate entities. E.g. extract the bulk of KVM's "TDP > > MMU" implementation to common code so that IOMMUFD doesn't need to > > reinvent the wheel. > > We've pretty much done this already, it is called "hmm" and it is what > the IO world uses. Merging/splitting huge page is just something that > needs some coding in the page table code, that people want for other > reasons anyhow. Not really. HMM is a wildly different implementation than KVM's TDP MMU. At a glance, HMM is basically a variation on the primary MMU, e.g. deals with VMAs, runs under mmap_lock (or per-VMA locks?), and faults memory into the primary MMU while walking the "secondary" HMM page tables. KVM's TDP MMU (and all of KVM's flavors of MMUs) is much more of a pure secondary MMU. The core of a KVM MMU maps GFNs to PFNs, the intermediate steps that involve the primary MMU are largely orthogonal. E.g. getting a PFN from guest_memfd instead of the primary MMU essentially boils down to invoking kvm_gmem_get_pfn() instead of __gfn_to_pfn_memslot(), the MMU proper doesn't care how the PFN was resolved. I.e. 99% of KVM's MMU logic has no interaction with the primary MMU. > > - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion > > mess, the truly nasty MTRR emulation (which I still hope to delete), the NX > > hugepage mitigation, etc. > > Does it? I think that just remains isolated in kvm. The output from > KVM is only a radix table top pointer, it is up to KVM how to manage > it still. Oh, I didn't mean from a code perspective, I meant from a behaviorial perspective. E.g. there's no reason to disallow huge mappings in the IOMMU because the CPU is vulnerable to the iTLB multi-hit mitigation. > > I'm not convinced that memory consumption is all that interesting. If a VM is > > mapping the majority of memory into a device, then odds are good that the guest > > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory > > overhead for pages tables is quite small, especially relative to the total amount > > of memory overheads for such systems. > > AFAIK the main argument is performance. It is similar to why we want > to do IOMMU SVA with MM page table sharing. > > If IOMMU mirrors/shadows/copies a page table using something like HMM > techniques then the invalidations will mark ranges of IOVA as > non-present and faults will occur to trigger hmm_range_fault to do the > shadowing. > > This means that pretty much all IO will always encounter a non-present > fault, certainly at the start and maybe worse while ongoing. > > On the other hand, if we share the exact page table then natural CPU > touches will usually make the page present before an IO happens in > almost all cases and we don't have to take the horribly expensive IO > page fault at all. I'm not advocating mirroring/copying/shadowing page tables between KVM and the IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing KVM code to do so. I wouldn't even be opposed to KVM outright managing the IOMMU's page tables. E.g. add an "iommu" flag to "union kvm_mmu_page_role" and then the implementation looks rather similar to this series. What terrifies is me sharing page tables between the CPU and the IOMMU verbatim. Yes, sharing page tables will Just Work for faulting in memory, but the downside is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications will also impact the IO path. My understanding is that IO page faults are at least an order of magnitude more expensive than CPU page faults. That means that what's optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page tables. E.g. based on our conversation at LPC, write-protecting guest memory to do dirty logging is not a viable option for the IOMMU because the latency of the resulting IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because the VM has passthrough (mediated?) devices would be likely a non-starter. One of my biggest concerns with sharing page tables between KVM and IOMMUs is that we will end up having to revert/reject changes that benefit KVM's usage due to regressing the IOMMU usage. If instead KVM treats IOMMU page tables as their own thing, then we can have divergent behavior as needed, e.g. different dirty logging algorithms, different software-available bits, etc. It would also allow us to define new ABI instead of trying to reconcile the many incompatibilies and warts in KVM's existing ABI. E.g. off the top of my head: - The virtual APIC page shouldn't be visible to devices, as it's not "real" guest memory. - Access tracking, i.e. page aging, by making PTEs !PRESENT because the CPU doesn't support A/D bits or because the admin turned them off via KVM's enable_ept_ad_bits module param. - Write-protecting GFNs for shadow paging when L1 is running nested VMs. KVM's ABI can be that device writes to L1's page tables are exempt. - KVM can exempt IOMMU page tables from KVM's awful "drop all page tables if any memslot is deleted" ABI. > We were not able to make bi-dir notifiers with with the CPU mm, I'm > not sure that is "relatively easy" :( I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the same". It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM to manage IOMMU page tables, then KVM could simply install mappings for multiple sets of page tables as appropriate.