Re: [RFC PATCH part-5 00/22] VMX emulation

Jason Chen CJ <jason.cj.chen@xxxxxxxxx> · Tue, 5 Sep 2023 09:47:01 +0000

On Tue, Jun 13, 2023 at 12:50:52PM -0700, Sean Christopherson wrote:
> Maybe?  A paravirt paging scheme could do whatever it wanted.  The APIs could be
> designed in such a way that L1 never needs to explicitly request a TLB flush,
> e.g. if the contract is that changes must always become immediately visible to L2.
> 
> And TLB flushing is but one small aspect of page table shadowing.  With PV paging,
> L1 wouldn't need to manage hardware-defined page tables, i.e. could use any arbitrary
> data type.  E.g. KVM as L1 could use an XArray to track L2 mappings.  And L0 in
> turn wouldn't need to have vendor specific code, i.e. pKVM on x86 (potentially
> *all* architectures) could have a single nested paging scheme for both Intel and
> AMD, as opposed to needing code to deal with the differences between EPT and NPT.
> 
> A few months back, I mentally worked through the flows[*] (I forget why I was
> thinking about PV paging), and I'm pretty sure that adapting x86's TDP MMU to
> support PV paging would be easy-ish, e.g. kvm_tdp_mmu_map() would become an
> XArray insertion (to track the L2 mapping) + hypercall (to inform L1 of the new
> mapping).
> 
> [*] I even though of a catchy name, KVM Paravirt Only Paging, a.k.a. KPOP ;-)

hi, Sean & all,

I did a POC[1] to support KPOP (KVM Paravirt Only Paging) for KVM on KVM
nested guest. I am not sure if such solution is welcome to KVM community,
I appreciate if you can give me some advice/direction. As I saw the
solution is straightforward and less memory cost (no double page tables),
but a rough benchmark based on stress-ng show less 1% improvement for
both cpu & vm stress test, comparing to legacy shadowing mode nested
guest solution.

Brief idea of this POC
----------------------

The brief idea of the POC is to intercept below x86 KVM MMU interfaces
to do three KPOP hypercalls - KVM_HC_KPOP_MMU_LOAD_UNLOAD,
KVM_HC_KPOP_MMU_MAP & KVM_HC_KPOP_MMU_UNMAP:

- int (*mmu_load)(struct kvm_vcpu *vcpu);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_LOAD_UNLOAD hypercall for MMU
  load to L0 KVM, L0 KVM shall help to create L2 guest MMU page table and
  ensure vcpu will load it as root pgd when corresponding nested vcpu is
  running.

- void (*mmu_unload)(struct kvm_vcpu *vcpu);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_LOAD_UNLOAD hypercall for MMU
  unload to L0 KVM, L0 KVM shall try to put & free corresponding L2 guest
  MMU page table.

- bool (*mmu_set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range
  *range);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_MAP hypercall for MMU remap
  to L0 KVM, L0 KVM shall try to remap the range's MMU mapping for all
  previous loaded L2 guest MMU page tables who belongs to L2 "kvm" and
  whose as_id (address id) is same as range->slot->as_id.

- bool (*mmu_unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range
  *range);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_UNMAP hypercall for MMU unmap
  to L0 KVM, L0 KVM shall try to unmap the range's MMU mapping for all
  previous loaded L2 guest MMU page tables who belongs to L2 "kvm" and
  whose as_id is same as range->slot->as_id.

- void (*mmu_zap_gfn_range)(struct kvm *kvm, gfn_t gfn_start, gfn_t
  gfn_end);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_UNMAP hypercall for MMU unmap
  to L0 KVM, L0 KVM shall try to unmap the {start, end} MMU mapping for
  all previous loaded L2 guest MMU page tables who belongs to L2 "kvm"
  (for all as_id).

- void (*mmu_zap_all)(struct kvm *kvm, bool fast);
  this ops(from L1) leads to KVM_HC_KPOP_MMU_UNMAP hypercall for MMU unmap
  to L0 KVM, L0 KVM shall try to zap all MMU mapping for all previous
  loaded L2 guest MMU page tables who belongs to L2 "kvm" (for all as_id).

- the page fault handling function (direct_page_fault) in L1 KVM is also
  changed in this POC to support KPOP MMU mapping, which leads to
  KVM_HC_KPOP_MMU_MAP hypercall, and L0 KVM leverage kvm_tdp_mmu_map
  to do the MMU page mapping for previous loaded L2 guest MMU page table.

How geust MMU page table be identified?
---------------------------------------

The L2 guest MMU page table is identified by its L1 vcpu_holder & as_id,
L1 KVM is running L2 vcpu after loading L2 vcpu info into corresponding
vcpu_holder - for x86 it's vmcs. When L1 KVM do mmu_load for its L2 guest
MMU page table, it will also define different as_id for such table - for
x86 it's based on whether vcpu is running under smm mode.

And in this POC, L0 KVM maintains L2 guest MMU for L1 KVM in a per-VM
hash table which hashed by the vcpu_holders. Struct kpop_guest_mmu and
several APIs are introduced for managing L2 guest MMU:

struct kpop_guest_mmu {
        struct hlist_node hnode;
        u64 vcpu_holder;
        u64 kvm_id;
        u64 as_id;
        hpa_t root_hpa;
        refcount_t count;
};

- int kpop_alloc_guest_mmu(struct kvm_vcpu *vcpu, u64 vcpu_holder, u64
  kvm_id, u64 as_id)
- void kpop_put_guest_mmu(struct kvm_vcpu *vcpu, u64 vcpu_holder, u64
  kvm_id, u64 as_id)
- struct kpop_guest_mmu *kpop_find_guest_mmu(struct kvm *kvm, u64
  vcpu_holder, u64 as_id)
- int kpop_reload_guest_mmu(struct kvm_vcpu *vcpu, bool check_vcpu)

TODOs & OPENs
-------------

There are still a lot of TODOs:

- L2 translation info (XArray) in L1 KVM
  L1 KVM may need maintain translation info (ngpa-to-gpa) for L2 guests,
  one possible use case is for MMIO fault optimization. A simple way is
  to maintain a translation info XArray in L1 KVM.

- support UMIP emulation
  UMIP emulation want L0 KVM do instruction emulation for L2 guest,
  which want to do nested address translation, usually it should be done
  by guest_kpop_mmu's gva_to_gpa ops (unimplemented kpop_gva_to_gpa in my
  POC),  we either do such translation based on L1 maintained translation
  table (in this case XArray may not be a good choice for L1 translation
  table), or we maintain another new translation table (e.g., another
  XArray) in L0 for L2 guest.

- age/test_age
  age/test_age mmu interfaces should be supported, e.g., for SWAP in L1 VM.

- page track
  page track should be supported, e.g., for GVT graphics page table shadowing usage.

- dirty log
  dirty log should be supported for VM migration.

[1]: https://github.com/intel-staging/pKVM-IA/tree/KPOP_RFC

-- 

Thanks
Jason CJ Chen