Re: [GIT PULL] KVM: x86: SVM changes for 6.4

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 26 Apr 2023 22:40:02 +0000

On Wed, Apr 26, 2023, Paolo Bonzini wrote:
> On Wed, Apr 26, 2023 at 10:02 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> > > This is probably the sub-PR for which I'm more interested in giving
> > > the code a closer look, but this is more about understanding the
> > > changes than it is about expecting something bad in it.
> >
> > 100% agree.  If you were to scrutinize only one thing for 6.4, the vNMI changes
> > are definitely my choice for extra eyeballs.
> 
> Interesting read. The split commits left me wondering _why_ patches
> 1-7 were needed for vNMI, but that's a known limitation of losing the
> cover letter, and the Link or Message-Id trailers try to amend for
> that.
> 
> I have a few comments indeed, most of which are absolutely nits and
> can be ignored or fixed as follow-ups. It's my turn to send a "belated
> review" patch series, which I'll do for -rc2, but please check if
> there are any disagreements.
> 
> First of all, this comment caught my attention:
> 
> +    /*
> +     * Rules for synchronizing int_ctl bits from vmcb02 to vmcb01:
> +     *
> +     * V_IRQ, V_IRQ_VECTOR, V_INTR_PRIO_MASK, V_IGN_TPR:  If L1 doesn't
> +     * intercept interrupts, then KVM will use vmcb02's V_IRQ (and related
> +     * flags) to detect interrupt windows for L1 IRQs (even if L1 uses
> +     * virtual interrupt masking).  Raise KVM_REQ_EVENT to ensure that
> +     * KVM re-requests an interrupt window if necessary, which implicitly
> +     * copies this bits from vmcb02 to vmcb01.
> +     *
> +     * V_TPR: If L1 doesn't use virtual interrupt masking, then L1's vTPR
> +     * is stored in vmcb02, but its value doesn't need to be copied from/to
> +     * vmcb01 because it is copied from/to the virtual APIC's TPR register
> +     * on each VM entry/exit.
> +     *
> +     * V_GIF: If nested vGIF is not used, KVM uses vmcb02's V_GIF for L1's
> +     * V_GIF.  However, GIF is architecturally clear on each VM exit, thus
> +     * there is no need to copy V_GIF from vmcb02 to vmcb01.
> +     */
> 
> "Rules for synchronizing int_ctl bits from vmcb02 to vmcb01" suggests
> that this is work done here, and it also misled me into looking at
> nested_sync_control_from_vmcb02 (which is instead about vmcb02 ->
> vmcb12).

+1.  I had a similar reaction when I first saw the code, but learned to live with
it after a few versions :-)

> So what about replacing it with
> 
>     * int_ctl bits from vmcb02 have to be synchronized to both vmcb12
> and vmcb01.
>     * The former is in nested_sync_control_from_vmcb02, invoked on every vmexit,
>     * while the latter is scattered all over the place:
> 
> and perhaps also call out nested_svm_virtualize_tpr(), sync_lapic_to_cr8() and
> sync_cr8_to_lapic in the V_TPR part?
> 
> Another super small thing which is not worth a respin (would have been):
> 
>  Subject: [PATCH 05/13] KVM: x86: Raise an event request when processing NMIs
> - if an NMI is pending
> + iff an NMI is pending
> 
> The "if" suggests that we were missing an event request, while "iff"
> suggests that we were doing them unnecessarily.

Gah, suprised I didn't catch that, I do love me some "iff".

> As an aside, I like the "coding style violation" of commit
> 400fee8c9b2d. Because the "limit = 2" case is anti-architectural, it
> makes more sense to have it as an "else" rather than as the default.
> An alternative could have been:
> 
>   unsigned limit = 1;
>   if (!static_call(kvm_x86_get_nmi_mask)(vcpu) && !vcpu->arch.nmi_injected)
>       limit = 2;
> 
> but the ugly condition makes this solution worse.
> 
> Next on, commit ab2ee212a57b ("KVM: x86: Save/restore all NMIs when
> multiple NMIs are pending"). Here, "allows userspace to restore 255
> pending NMIs" in the commit message is kinda scary, and I thought
> about following up with a fixlet to KVM_SET_VCPU_EVENTS:
> 
> +        events->nmi.pending = min(vcpu->arch.nmi_pending, 2);
>          vcpu->arch.nmi_pending = 0;
>          atomic_set(&vcpu->arch.nmi_queued, events->nmi.pending);
>          kvm_make_request(KVM_REQ_NMI, vcpu);
> 
> but really this isn't needed because process_nmi() does have
> 
>      vcpu->arch.nmi_pending = min(vcpu->arch.nmi_pending, limit);
> 
> So in the end this is also fine, just a remark on the commit message.

Even restoring 255 NMIs would be fine from KVM's perspective.  The guest might
not be happy, but that's likely true if there are _any_ spurious NMIs.  IIRC, I
didn't call out the process_nmi() behavior because having 255 pending virtual
NMIs doesn't put KVM at risk anymore than does having 2 pending virtual NMIs.

> May be worth an additional comment instead here in
> KVM_SET_VCPU_EVENTS, before the atomic_set().
> 
> On to the actual vNMI patch:
> 
> +    /*
> +     * KVM should never request an NMI window when vNMI is enabled, as KVM
> +     * allows at most one to-be-injected NMI and one pending NMI, i.e. if
> +     * two NMIs arrive simultaneously, KVM will inject one and set
> +     * V_NMI_PENDING for the other.  WARN, but continue with the standard
> +     * single-step approach to try and salvage the pending NMI.
> +     */
> +    WARN_ON_ONCE(is_vnmi_enabled(svm));
> 
> Understandable, but also scary. :) I am not sure losing a pending NMI
> is a big deal. IIRC the "limit = 2" case only matters because Windows
> uses an NMI shootdown when rebooting the system and in some cases it
> would hang; but in this case we're in a buggy situation. And it means
> having to think about how the IRET+single-step method interacts with
> vNMI, and what is the meaning of !svm->awaiting_iret_completion
> (tested right below) in this buggy case. I'd just "return" here.

Heh, Santosh originally had it return and I had the opposite reaction: why bail
and *guarantee* problems for the guest, instead of continuing on and *maybe*
causing problems for the guest.

 : Last thought, unless there's something that will obviously break, it's probably
 : better to WARN and continue than to bail.  I.e. do the single-step and hope for
 : the best.  Bailing at this point doesn't seem like it would help.

I don't have a super strong preference.  As you said, KVM is already in a buggy
scenario, though my vote is still to carry on.

https://lkml.kernel.org/r/Y9mwz%2FG6%2BG8NSX3%2B%40google.com

> And another small nit to conclude - kvm_get_nr_pending_nmis() could be static.
> 
> The only thing that leaves me a bit puzzled is the naming and
> rationale of get_vnmi_vmcb_l1(). I'll let you or Santosh clarify that.

Ah, I think what happened is that I complained about is_vnmi_enabled() being
misleading (https://lore.kernel.org/all/Y9m0q31NBmsnhVGD@xxxxxxxxxx), but instead
of renaming the top-level helper, Santosh added an inner helper and here we are.

Re-reading what I wrote, and looking at the code with fresh eyes, I don't think
I agree with past me regarding the name of is_vnmi_enabled().  My biggest
objection/confusion with the original code was the comment saying vNMI was
"inhibited".  Appending "for_l1()" makes the usage in the callers quite confusing.

So my vote is to do:

static inline bool is_vnmi_enabled(struct vcpu_svm *svm)
{
	if (!vnmi)
		return false;

	if (is_guest_mode(&svm->vcpu))
		return false;

	return !!(svm->vmcb01.ptr->control.int_ctl & V_NMI_ENABLE_MASK);
}

---

diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index f44751dd8d5d..5b604565d4b3 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -556,25 +556,15 @@ static inline bool is_x2apic_msrpm_offset(u32 offset)
               (msr < (APIC_BASE_MSR + 0x100));
 }
 
-static inline struct vmcb *get_vnmi_vmcb_l1(struct vcpu_svm *svm)
+static inline bool is_vnmi_enabled(struct vcpu_svm *svm)
 {
        if (!vnmi)
-               return NULL;
+               return false;
 
        if (is_guest_mode(&svm->vcpu))
-               return NULL;
-       else
-               return svm->vmcb01.ptr;
-}
-
-static inline bool is_vnmi_enabled(struct vcpu_svm *svm)
-{
-       struct vmcb *vmcb = get_vnmi_vmcb_l1(svm);
-
-       if (vmcb)
-               return !!(vmcb->control.int_ctl & V_NMI_ENABLE_MASK);
-       else
                return false;
+
+       return !!(svm->vmcb01.ptr->control.int_ctl & V_NMI_ENABLE_MASK);
 }
 
 /* svm.c */