On 6/13/23 21:50, Sean Christopherson wrote: > On Fri, Jun 09, 2023, Dmytro Maluka wrote: >> Yeah indeed, good point. >> >> Is my understanding correct: TLB flush is still gonna be requested by >> the host VM via a hypercall, but the benefit is that the hypervisor >> merely needs to do INVEPT? > > Maybe? A paravirt paging scheme could do whatever it wanted. The APIs could be > designed in such a way that L1 never needs to explicitly request a TLB flush, > e.g. if the contract is that changes must always become immediately visible to L2. > > And TLB flushing is but one small aspect of page table shadowing. With PV paging, > L1 wouldn't need to manage hardware-defined page tables, i.e. could use any arbitrary > data type. E.g. KVM as L1 could use an XArray to track L2 mappings. And L0 in > turn wouldn't need to have vendor specific code, i.e. pKVM on x86 (potentially > *all* architectures) could have a single nested paging scheme for both Intel and > AMD, as opposed to needing code to deal with the differences between EPT and NPT. > > A few months back, I mentally worked through the flows[*] (I forget why I was > thinking about PV paging), and I'm pretty sure that adapting x86's TDP MMU to > support PV paging would be easy-ish, e.g. kvm_tdp_mmu_map() would become an > XArray insertion (to track the L2 mapping) + hypercall (to inform L1 of the new > mapping). > > [*] I even though of a catchy name, KVM Paravirt Only Paging, a.k.a. KPOP ;-) Yeap indeed, thanks. (I should have thought myself that it's rather pointless to use hardware-defined page tables and TLB semantics in L1 if we go full PV.) In pKVM on ARM [1] it already looks similar to what you described and is pretty simple: L1 pins the guest page, issues __pkvm_host_map_guest hypercall to map it, and remembers it in a RB-tree to unpin it later. One concern though: can this be done lock-efficiently? For example, in this pKVM-ARM code in [1] this (hypercall + RB-tree insertion) is done under write-locked kvm->mmu_lock, so I assume it is prone to contention when there are stage-2 page faults occurring simultaneously on multiple CPUs from the same VM. In pKVM on Intel we also have the same per-VM lock contention issue, though in L0 (see pkvm_handle_shadow_ept_violation() in [2]) and we are already seeing ~50% perf drop caused by it in some benchmarks. (To be precise, though, eliminating this per-VM write-lock would not be enough for eliminating the contention: on both ARM and x86 there is also global locking in pKVM in L0 down the road [3], for different reasons.) [1] https://android.googlesource.com/kernel/common/+/d73b3af21fb90f6556383865af6ee16e4735a4a6/arch/arm64/kvm/mmu.c#1341 [2] https://lore.kernel.org/all/20230312180345.1778588-9-jason.cj.chen@xxxxxxxxx/ [3] https://android.googlesource.com/kernel/common/+/d73b3af21fb90f6556383865af6ee16e4735a4a6/arch/arm64/kvm/hyp/nvhe/mem_protect.c#2176