On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote: > On Sat, Dec 02, 2023, Yan Zhao wrote: > Please list out the pros and cons for each. In the cons column for piggybacking > KVM's page tables: > > - *Significantly* increases the complexity in KVM The complexity to KVM (up to now) are a. fault in non-vCPU context b. keep exported root always "active" c. disallow non-coherent DMAs d. movement of SPTE_MMU_PRESENT for a, I think it's accepted, and we can see eager page split allocates non-leaf pages in non-vCPU context already. for b, it requires exported TDP root to keep "active" in KVM's "fast zap" (which invalidates all active TDP roots). And instead, the exported TDP's leaf entries are all zapped. Though it looks not "fast" enough, it avoids an unnecessary root page zap, and it's actually not frequent -- - one for memslot removal (IO page fault is unlikey to happen during VM boot-up) - one for MMIO gen wraparound (which is rare) - one for nx huge page mode change (which is rare too) for c, maybe we can work out a way to remove the MTRR stuffs. for d, I added a config to turn on/off this movement. But right, KVM side will have to sacrifice a bit for software usage and take care of it when the config is on. > - Puts constraints on what KVM can/can't do in the future (see the movement > of SPTE_MMU_PRESENT). > - Subjects IOMMUFD to all of KVM's historical baggage, e.g. the memslot deletion > mess, the truly nasty MTRR emulation (which I still hope to delete), the NX > hugepage mitigation, etc. NX hugepage mitigation only exists on certain CPUs. I don't see it in recent Intel platforms, e.g. SPR and GNR... We can disallow sharing approach if NX huge page mitigation is enabled. But if pinning or partial pinning are not involved, nx huge page will only cause unnecessary zap to reduce performance, but functionally it still works well. Besides, for the extra IO invalidation involved in TDP zap, I think SVM has the same issue. i.e. each zap in primary MMU is also accompanied by a IO invalidation. > > Please also explain the intended/expected/targeted use cases. E.g. if the main > use case is for device passthrough to slice-of-hardware VMs that aren't memory > oversubscribed, > The main use case is for device passthrough with all devices supporting full IOPF. Opportunistically, we hope it can be used in trusted IO, where TDP are shared to IO side. So, there's only one page table audit required and out-of-sync window for mappings between CPU and IO side can also be eliminated. > > - Unified page table management > > The complexity of allocating guest pages per GPAs, registering to MMU > > notifier on host primary MMU, sub-page unmapping, atomic page merge/split > > Please find different terminology than "sub-page". With Sub-Page Protection, Intel > has more or less established "sub-page" to mean "less than 4KiB granularity". But > that can't possibly what you mean here because KVM doesn't support (un)mapping > memory at <4KiB granularity. Based on context above, I assume you mean "unmapping > arbitrary pages within a given range". > Ok, sorry for this confusion. By "sub-page unmapping", I mean atomic huge page splitting and unmapping smaller range in the previous huge page.