On Tue, Jun 13, 2023 at 12:50:52PM -0700, Sean Christopherson wrote: > Maybe? A paravirt paging scheme could do whatever it wanted. The APIs could be > designed in such a way that L1 never needs to explicitly request a TLB flush, > e.g. if the contract is that changes must always become immediately visible to L2. > > And TLB flushing is but one small aspect of page table shadowing. With PV paging, > L1 wouldn't need to manage hardware-defined page tables, i.e. could use any arbitrary > data type. E.g. KVM as L1 could use an XArray to track L2 mappings. And L0 in > turn wouldn't need to have vendor specific code, i.e. pKVM on x86 (potentially > *all* architectures) could have a single nested paging scheme for both Intel and > AMD, as opposed to needing code to deal with the differences between EPT and NPT. > > A few months back, I mentally worked through the flows[*] (I forget why I was > thinking about PV paging), and I'm pretty sure that adapting x86's TDP MMU to > support PV paging would be easy-ish, e.g. kvm_tdp_mmu_map() would become an > XArray insertion (to track the L2 mapping) + hypercall (to inform L1 of the new > mapping). > > [*] I even though of a catchy name, KVM Paravirt Only Paging, a.k.a. KPOP ;-) hi, Sean & all, I did a POC[1] to support KPOP (KVM Paravirt Only Paging) for KVM on KVM nested guest. I am not sure if such solution is welcome to KVM community, I appreciate if you can give me some advice/direction. As I saw the solution is straightforward and less memory cost (no double page tables), but a rough benchmark based on stress-ng show less 1% improvement for both cpu & vm stress test, comparing to legacy shadowing mode nested guest solution. Brief idea of this POC ---------------------- The brief idea of the POC is to intercept below x86 KVM MMU interfaces to do three KPOP hypercalls - KVM_HC_KPOP_MMU_LOAD_UNLOAD, KVM_HC_KPOP_MMU_MAP & KVM_HC_KPOP_MMU_UNMAP: - int (*mmu_load)(struct kvm_vcpu *vcpu); this ops(from L1) leads to KVM_HC_KPOP_MMU_LOAD_UNLOAD hypercall for MMU load to L0 KVM, L0 KVM shall help to create L2 guest MMU page table and ensure vcpu will load it as root pgd when corresponding nested vcpu is running. - void (*mmu_unload)(struct kvm_vcpu *vcpu); this ops(from L1) leads to KVM_HC_KPOP_MMU_LOAD_UNLOAD hypercall for MMU unload to L0 KVM, L0 KVM shall try to put & free corresponding L2 guest MMU page table. - bool (*mmu_set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); this ops(from L1) leads to KVM_HC_KPOP_MMU_MAP hypercall for MMU remap to L0 KVM, L0 KVM shall try to remap the range's MMU mapping for all previous loaded L2 guest MMU page tables who belongs to L2 "kvm" and whose as_id (address id) is same as range->slot->as_id. - bool (*mmu_unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range); this ops(from L1) leads to KVM_HC_KPOP_MMU_UNMAP hypercall for MMU unmap to L0 KVM, L0 KVM shall try to unmap the range's MMU mapping for all previous loaded L2 guest MMU page tables who belongs to L2 "kvm" and whose as_id is same as range->slot->as_id. - void (*mmu_zap_gfn_range)(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end); this ops(from L1) leads to KVM_HC_KPOP_MMU_UNMAP hypercall for MMU unmap to L0 KVM, L0 KVM shall try to unmap the {start, end} MMU mapping for all previous loaded L2 guest MMU page tables who belongs to L2 "kvm" (for all as_id). - void (*mmu_zap_all)(struct kvm *kvm, bool fast); this ops(from L1) leads to KVM_HC_KPOP_MMU_UNMAP hypercall for MMU unmap to L0 KVM, L0 KVM shall try to zap all MMU mapping for all previous loaded L2 guest MMU page tables who belongs to L2 "kvm" (for all as_id). - the page fault handling function (direct_page_fault) in L1 KVM is also changed in this POC to support KPOP MMU mapping, which leads to KVM_HC_KPOP_MMU_MAP hypercall, and L0 KVM leverage kvm_tdp_mmu_map to do the MMU page mapping for previous loaded L2 guest MMU page table. How geust MMU page table be identified? --------------------------------------- The L2 guest MMU page table is identified by its L1 vcpu_holder & as_id, L1 KVM is running L2 vcpu after loading L2 vcpu info into corresponding vcpu_holder - for x86 it's vmcs. When L1 KVM do mmu_load for its L2 guest MMU page table, it will also define different as_id for such table - for x86 it's based on whether vcpu is running under smm mode. And in this POC, L0 KVM maintains L2 guest MMU for L1 KVM in a per-VM hash table which hashed by the vcpu_holders. Struct kpop_guest_mmu and several APIs are introduced for managing L2 guest MMU: struct kpop_guest_mmu { struct hlist_node hnode; u64 vcpu_holder; u64 kvm_id; u64 as_id; hpa_t root_hpa; refcount_t count; }; - int kpop_alloc_guest_mmu(struct kvm_vcpu *vcpu, u64 vcpu_holder, u64 kvm_id, u64 as_id) - void kpop_put_guest_mmu(struct kvm_vcpu *vcpu, u64 vcpu_holder, u64 kvm_id, u64 as_id) - struct kpop_guest_mmu *kpop_find_guest_mmu(struct kvm *kvm, u64 vcpu_holder, u64 as_id) - int kpop_reload_guest_mmu(struct kvm_vcpu *vcpu, bool check_vcpu) TODOs & OPENs ------------- There are still a lot of TODOs: - L2 translation info (XArray) in L1 KVM L1 KVM may need maintain translation info (ngpa-to-gpa) for L2 guests, one possible use case is for MMIO fault optimization. A simple way is to maintain a translation info XArray in L1 KVM. - support UMIP emulation UMIP emulation want L0 KVM do instruction emulation for L2 guest, which want to do nested address translation, usually it should be done by guest_kpop_mmu's gva_to_gpa ops (unimplemented kpop_gva_to_gpa in my POC), we either do such translation based on L1 maintained translation table (in this case XArray may not be a good choice for L1 translation table), or we maintain another new translation table (e.g., another XArray) in L0 for L2 guest. - age/test_age age/test_age mmu interfaces should be supported, e.g., for SWAP in L1 VM. - page track page track should be supported, e.g., for GVT graphics page table shadowing usage. - dirty log dirty log should be supported for VM migration. [1]: https://github.com/intel-staging/pKVM-IA/tree/KPOP_RFC -- Thanks Jason CJ Chen