Re: [RFC PATCH part-5 00/22] VMX emulation

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 13 Jun 2023 12:50:52 -0700

On Fri, Jun 09, 2023, Dmytro Maluka wrote:
> On 6/9/23 04:07, Chen, Jason CJ wrote:
> > I think with PV design, we can benefit from skip shadowing. For example, a TLB flush
> > could be done in hypervisor directly, while shadowing EPT need emulate it by destroy
> > shadow EPT page table entries then do next shadowing upon ept violation.

This is a bit misleading.  KVM has an effective TLB for nested TDP only for 4KiB
pages; larger shadow pages are never allowed to go out-of-sync, i.e. KVM doesn't
wait until L1 does a TLB flush to update SPTEs.  KVM does "unload" roots, e.g. to
emulate INVEPT, but that usually just ends up being an extra slow TLB flush in L0,
because nested TDP SPTEs rarely go unsync in practice.  The patterns for hypervisors
managing VM memory don't typically trigger the types of PTE modifications that
result in unsync SPTEs.

I actually have a (very tiny) patch sitting around somwhere to disable unsync support
when TDP is enabled.  There is a very, very thoeretical bug where KVM might fail
to honor when a guest TDP PTE change is architecturally supposed to be visible,
and the simplest fix (by far) is to disable unsync support.  Disabling TDP+unsync
is a viable fix because unsync support is almost never used for nested TDP.  Legacy
shadow paging on the other hand *significantly* benefits from unsync support, e.g.
when the guest is managing CoW mappings. I haven't gotten around to posting the
patch to disable unsync on TDP purely because the flaw is almost comically theoretical.

Anyways, the point is that the TLB flushing side of nested TDP isn't all that
interesting.

> Yeah indeed, good point.
> 
> Is my understanding correct: TLB flush is still gonna be requested by
> the host VM via a hypercall, but the benefit is that the hypervisor
> merely needs to do INVEPT?

Maybe?  A paravirt paging scheme could do whatever it wanted.  The APIs could be
designed in such a way that L1 never needs to explicitly request a TLB flush,
e.g. if the contract is that changes must always become immediately visible to L2.

And TLB flushing is but one small aspect of page table shadowing.  With PV paging,
L1 wouldn't need to manage hardware-defined page tables, i.e. could use any arbitrary
data type.  E.g. KVM as L1 could use an XArray to track L2 mappings.  And L0 in
turn wouldn't need to have vendor specific code, i.e. pKVM on x86 (potentially
*all* architectures) could have a single nested paging scheme for both Intel and
AMD, as opposed to needing code to deal with the differences between EPT and NPT.

A few months back, I mentally worked through the flows[*] (I forget why I was
thinking about PV paging), and I'm pretty sure that adapting x86's TDP MMU to
support PV paging would be easy-ish, e.g. kvm_tdp_mmu_map() would become an
XArray insertion (to track the L2 mapping) + hypercall (to inform L1 of the new
mapping).

[*] I even though of a catchy name, KVM Paravirt Only Paging, a.k.a. KPOP ;-)