RE: [PATCH 0/4]: KVM: nVMX: VPID optimizations

"Christopherson, Sean J" <sean.j.christopherson@xxxxxxxxx> · Fri, 1 Jun 2018 18:25:23 +0000

On Thu, 2018-05-31, Liran Alon wrote:
> ----- pbonzini@xxxxxxxxxx wrote:
> 
> On 31/05/2018 14:52, Liran Alon wrote:
> > 
> > ----- pbonzini@xxxxxxxxxx wrote:
> > 
> >> On 23/05/2018 16:44, Sean Christopherson wrote:
> >>> Unloading the MMU on nested entry/exit doesn't seem deliberate,
> >>> e.g. why bother with VPID handling in prepare_vmcs02() if KVM
> >>> intends to unconditionally flush?  I think figuring out how to
> >>> avoid unloading the MMU in those cases will resolve the issue
> >>> of the TLB being flushed on every switch between L1 and L2,
> >>> though I get the feeling that that will mean doing a holistic
> >>> analysis of the (nested) MMU handling.
> >>
> >> My plan there was just to add a third kvm_mmu struct, which (as a 
> >> start) requires getting rid of all references to vcpu->arch.mmu in
> > 
> > >> arch/x86/kvm/.
> > >>
> > >> Paolo
> > > 
> > > I'm planning to a submit a series to handle this issue.
> > > However, I'm not sure I got your plan. Can you elaborate?
> > > How do you vision adding a third kvm_mmu struct will help resolve
> > this issue?
> > 
> > I would like to avoid the kvm_mmu_reset_context on nested
> > vmentry/vmexit.  Hopefully everything flows from that...
> > 
> > Paolo
> 
> I understand that but there are some aspects to consider here.
> 
> Switching between EPT01 and EPT02 must currently involve setting root_hpa to INVALID such that
> vcpu_enter_guest() -> kvm_mmu_reload() -> kvm_mmu_load() -> vmx_set_cr3() will be called to actually
> switch the EPTP.
> 
> In addition, because EPT02 is shadowing EPT12, there is a need to sync EPT02 to EPT12
> before each load of EPT02 into EPTP. Thus, a VMEntry operation is similar to loading a new
> CR3 value on standard shadow MMU. In contrast, a VMExit operation doesn't need to sync anything
> as EPT01 is in direct-mode. This is why VMEntry indeed calls kvm_mmu_unload() which 
> calls mmu_free_roots(), exactly the same as kvm_set_cr3() calls kvm_mmu_new_cr3() which
> calls mmu_free_roots().
> 
> When running with VPID & EPT enabled, TLB entries are tagged with both VPID and EPTP.
> Therefore, we should not flush TLB when switching between EPTPs. However, currently vmx_set_cr3()
> always flush TLB.
> When digging to why this is done, we will note the following:
> 1) Before VPID existed, every VMEntry/VMExit cause a full TLB flush. Therefore, vmx_set_cr3() didn't
> contain a vmx_flush_tlb().
> 2) On commit which enabled VPID usage in KVM (2384d2b32640 ("KVM: VMX: Enable Virtual Processor
> Identification (VPID)")), a vmx_flush_tlb() was added to vmx_set_cr3(). In addition, a call to
> vpid_sync_vcpu_all() was added to vmx_vcpu_load() when switching a loaded VMCS on physical CPU.
> 
> To me, (2) seems to be not optimized because:
> (a) Any shadow MMU sync/zap operation should request a KVM_REQ_TLB_FLUSH. In addition, any
> shadow CR3 change should also result in a shadow MMU sync which will request a KVM_REQ_TLB_FLUSH.
> Thus, no need to rely on flush that happens as a result of vmx_set_cr3() called because root_hpa is INVALID.

In a nested scenario, I don't think we can rely on a SPTE zap to correctly
flush the correct TLB context.  In fact, I think we should actually skip the
flush entirely in this scenario.  E.g. an EPT violation in L1 due to a write
to L2's (shadowed) EPT tables will zap the SPTE and do kvm_flush_remote_tlbs().
vCPUs that are currently executing L1 will process the flush request on L1's
EPT, not L2's EPT (or VPID).  Flushing L1 is completely unnecessary, and it's
L1's responsibility to do a TLB shootdown as needed.  Assuming L1 isn't buggy,
L1 will trigger INVEPT on all L1 vCPUs, which is when L0 should actually flush
L2's EPT context.

> (b) When switching a loaded VMCS, we need to INVVPID single-context only in case new VMCS VPID was already
> used by another VMCS which was run on this CPU. However, because vmx_vpid_bitmap is global, we can
> just make sure to INVVPID single-context a VPID on all CPUs when a vCPU is destroyed and a call to
> free_vpid() is made.

Makes sense.

> (c) Similarly, because EPTP are globally unique, INVEPT should be used today only in case MMU requested
> a KVM_REQ_TLB_FLUSH as a result of EPT entries changes or because a EPT was freed (zap) which
> should also result in KVM_REQ_TLB_FLUSH. Thus, we should also not need to INVEPT on vmx_set_cr3().

I agree with everything stated for (2c), but removing INVEPT from vmx_set_cr3()
is going to require a full audit of the TLB flushing code.  There are blatant
bugs related to TLB flushing that are likely being masked by the fact that
changing the EPTP always results in flushing the new EPT context.  Or maybe
a better way to look at it is that __vmx_flush_tlb() was added to vmx_set_cr3()
in order to workaround some of those bugs (e.g. a nested shadow SPTE zap),
which in turn lead to more bugs due to __vmx_flush_tlb() being optimized in
weird ways *because* vmx_set_cr3() always flushes the new context.

E.g. "f0b98c02c18a (kvm: vmx: Don't use INVVPID when EPT is enabled, 2017-03-15)"
modified __vmx_flush_tlb() so that KVM never actually does INVVPID when EPT
is enabled.  As a result, the VPID handling in prepare_vmcs02() is completely
broken as it's effectively flushing L1's EPT context instead of L2's VPID
context (the VPID handling runs before we load L2's nested context).  Even
handle_invvpid() does an INVEPT, which is logically wrong even if it doesn't
cause functional problems.  The only exception is for the recently added
paravirtualized TLB flush case where KVM is emulating a full flush of the
guest's TLB and so explicitly does an INVVPID.

And as mentioned above, the SPTE zap flow doesn't actually guarantee L2's EPT
is flushed.  This is ok (in theory) since L0 intercepts L1's INVVPID and INVEPT,
but unfortunately both handle_invvpid() and handle_invept() are broken, e.g.
today they both flush L1's EPT.  handle_invvpid() is fundamentally correct,
i.e. it intends to flush vpid02, it just wasn't updated correctly as part of
"f0b98c02c18a (kvm: vmx: Don't use INVVPID when EPT is enabled, 2017-03-15)".
handle_invept() on the other hand appears to be have been broken since it was
introduced, e.g. it has always done a simple KVM_REQ_TLB_FLUSH, as opposed to
flushing L2's associated shadow EPT (if L1 is target a specific context) or
L2's VPID (if VPID is enabled, probably have to resort to a full EPT flush
without VPID).

> Therefore, I think the real question is why vmx_flush_tlb() is done in vmx_set_cr3() to begin with.
> If it would not do so, TLB wouldn't be flushed when switching between EPT01 and EPT02 and vice-versa.
> Even if kvm_mmu_unload() is called (similar to as done when switching CR3 values on standard shadow MMU).
> 
> Thoughts?

I like your idea a lot more than simply trying to avoid an MMU reset when
switching between L1 and L2.  Avoiding the reset seems like it would be
yet another workaround to a fundamentally broken system.  Nested paging,
and the nested EPT+VPID case in particular, has a lot of potential for
eliminating unnecessary TLB flushes.

> -Liran
>