Re: [RFC PATCH part-5 00/22] VMX emulation

Dmytro Maluka <dmy@xxxxxxxxxxxx> · Thu, 15 Jun 2023 20:07:32 +0200

On 6/13/23 21:50, Sean Christopherson wrote:
> On Fri, Jun 09, 2023, Dmytro Maluka wrote:
>> Yeah indeed, good point.
>>
>> Is my understanding correct: TLB flush is still gonna be requested by
>> the host VM via a hypercall, but the benefit is that the hypervisor
>> merely needs to do INVEPT?
> 
> Maybe?  A paravirt paging scheme could do whatever it wanted.  The APIs could be
> designed in such a way that L1 never needs to explicitly request a TLB flush,
> e.g. if the contract is that changes must always become immediately visible to L2.
> 
> And TLB flushing is but one small aspect of page table shadowing.  With PV paging,
> L1 wouldn't need to manage hardware-defined page tables, i.e. could use any arbitrary
> data type.  E.g. KVM as L1 could use an XArray to track L2 mappings.  And L0 in
> turn wouldn't need to have vendor specific code, i.e. pKVM on x86 (potentially
> *all* architectures) could have a single nested paging scheme for both Intel and
> AMD, as opposed to needing code to deal with the differences between EPT and NPT.
> 
> A few months back, I mentally worked through the flows[*] (I forget why I was
> thinking about PV paging), and I'm pretty sure that adapting x86's TDP MMU to
> support PV paging would be easy-ish, e.g. kvm_tdp_mmu_map() would become an
> XArray insertion (to track the L2 mapping) + hypercall (to inform L1 of the new
> mapping).
> 
> [*] I even though of a catchy name, KVM Paravirt Only Paging, a.k.a. KPOP ;-)

Yeap indeed, thanks. (I should have thought myself that it's rather
pointless to use hardware-defined page tables and TLB semantics in L1 if
we go full PV.) In pKVM on ARM [1] it already looks similar to what you
described and is pretty simple: L1 pins the guest page, issues
__pkvm_host_map_guest hypercall to map it, and remembers it in a RB-tree
to unpin it later.

One concern though: can this be done lock-efficiently? For example, in
this pKVM-ARM code in [1] this (hypercall + RB-tree insertion) is done
under write-locked kvm->mmu_lock, so I assume it is prone to contention
when there are stage-2 page faults occurring simultaneously on multiple
CPUs from the same VM. In pKVM on Intel we also have the same per-VM
lock contention issue, though in L0 (see
pkvm_handle_shadow_ept_violation() in [2]) and we are already seeing
~50% perf drop caused by it in some benchmarks.

(To be precise, though, eliminating this per-VM write-lock would not be
enough for eliminating the contention: on both ARM and x86 there is also
global locking in pKVM in L0 down the road [3], for different reasons.)

[1] https://android.googlesource.com/kernel/common/+/d73b3af21fb90f6556383865af6ee16e4735a4a6/arch/arm64/kvm/mmu.c#1341

[2] https://lore.kernel.org/all/20230312180345.1778588-9-jason.cj.chen@xxxxxxxxx/

[3] https://android.googlesource.com/kernel/common/+/d73b3af21fb90f6556383865af6ee16e4735a4a6/arch/arm64/kvm/hyp/nvhe/mem_protect.c#2176