Re: [PATCH v2 10/11] KVM: SVM: implement support for vNMI

Santosh Shukla <santosh.shukla@xxxxxxx> · Fri, 10 Feb 2023 17:54:09 +0530

On 2/1/2023 5:52 AM, Sean Christopherson wrote:
> On Tue, Nov 29, 2022, Maxim Levitsky wrote:
>> This patch implements support for injecting pending
>> NMIs via the .kvm_x86_set_hw_nmi_pending using new AMD's vNMI
> 
> Wrap closer to 75 chars, and please try to be consistent in how your format
> things like changelogs and comments.  It's jarring as a reader when the wrap
> column is constantly changing.
> 
>> feature.
> 
> Please combine the introduction, usage, and implementation of the hew kvm_x86_ops,
> i.e. introduce and use the ops in this patch.  It's extremely difficult to review
> the common x86 code that uses the ops without seeing how they're implemented in
> SVM.  I believe the overall size/scope of the patch can be kept reasonable by
> introducing some of the common changes in advance of the new ops, e.g. tweaking
> the KVM_SET_VCPU_EVENTS flow.
> 
>> Note that the vNMI can't cause a VM exit, which is needed
>> when a nested guest intercepts NMIs.
> 
> I can't tell if this is saying "SVM doesn't allow intercepting virtual NMIs", or
> "KVM never enables virtual NMI interception".
> 

I think, It meant to say that vNMI doesn't need nmi_window_exiting feature to
pend the new virtual NMI. Will reword.

>> Therefore to avoid breaking nesting, the vNMI is inhibited while
>> a nested guest is running and instead, the legacy NMI window
>> detection and delivery method is used.
> 
> State what KVM does, don't describe the effects.  E.g. "Disable vNMI while running
> L2".  When a changelog describes the effects, it's unclear whether the effects are
> new behavior introduced by the patch, hardware behavior, etc...
> 
>> While it is possible to passthrough the vNMI if a nested guest
>> doesn't intercept NMIs, such usage is very uncommon, and it's
>> not worth to optimize for.
> 
> Can you elaborate on why not?  It's not obvious to me that the code will end up
> more complex, and oftentimes omitting code increases the cognitive load on readers,
> i.e. makes things more complex in a way.  vNMI is mutually exclusive with NMI
> passthrough, i.e. KVM doesn't need to handle NMI passthrough and vNMI simultaneously.
> 
>> Signed-off-by: Santosh Shukla <santosh.shukla@xxxxxxx>
>> Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
> 
> SoB chain is wrong.  Maxim is credited as the sole Author, i.e. Santosh shouldn't
> have a SoB.  Assuming the intent is to attribute both of ya'll this needs to be
> 
>  Co-developed-by: Santosh Shukla <santosh.shukla@xxxxxxx>
>  Signed-off-by: Santosh Shukla <santosh.shukla@xxxxxxx>
>  Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
> 
> if Maxim is the primary author, or this if Santosh is the primary author
> 
>  From: Santosh Shukla <santosh.shukla@xxxxxxx>
> 
>  <blah blah blah>
> 
>  Signed-off-by: Santosh Shukla <santosh.shukla@xxxxxxx>
>  Developed-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
>  Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx>
> 

Will sort those in v3.

>> ---
>>  arch/x86/kvm/svm/nested.c |  42 +++++++++++++++
>>  arch/x86/kvm/svm/svm.c    | 111 ++++++++++++++++++++++++++++++--------
>>  arch/x86/kvm/svm/svm.h    |  10 ++++
>>  3 files changed, 140 insertions(+), 23 deletions(-)
>>
>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>> index e891318595113e..5bea672bf8b12d 100644
>> --- a/arch/x86/kvm/svm/nested.c
>> +++ b/arch/x86/kvm/svm/nested.c
>> @@ -623,6 +623,42 @@ static bool is_evtinj_nmi(u32 evtinj)
>>  	return type == SVM_EVTINJ_TYPE_NMI;
>>  }
>>  
>> +static void nested_svm_save_vnmi(struct vcpu_svm *svm)
> 
> Please avoid save/restore names.  KVM (selftests in particular) uses save/restore
> to refer to a saving and restoring state across a migration.  "sync" is probably
> the best option, or just open code the flows.
> 

ok.

I chose to open code that way I don't need to consider using svm->nmi_masked which should be
used for non-vNMI case.

>> +{
>> +	struct vmcb *vmcb01 = svm->vmcb01.ptr;
>> +
>> +	/*
>> +	 * Copy the vNMI state back to software NMI tracking state
>> +	 * for the duration of the nested run
>> +	 */
>> +
>> +	svm->nmi_masked = vmcb01->control.int_ctl & V_NMI_MASK;
>> +	svm->vcpu.arch.nmi_pending += vmcb01->control.int_ctl & V_NMI_PENDING;
> 
> This is wrong.  V_NMI_PENDING is bit 11, i.e. the bitwise-AND does not yield a
> boolean value and will increment nmi_pending by 2048 instead of by 1.
>

Right.

> 	if (vmcb01->control.int_ctl & V_NMI_PENDING)
> 		svm->vcpu.arch.nmi_pending++;
> 
> And this needs a KVM_REQ_EVENT to ensure KVM processes the newly pending NMI.
> 

Ok.

>> +}
>> +
>> +static void nested_svm_restore_vnmi(struct vcpu_svm *svm)
>> +{
>> +	struct kvm_vcpu *vcpu = &svm->vcpu;
>> +	struct vmcb *vmcb01 = svm->vmcb01.ptr;
>> +
>> +	/*
>> +	 * Restore the vNMI state from the software NMI tracking state
>> +	 * after a nested run
>> +	 */
>> +
>> +	if (svm->nmi_masked)
>> +		vmcb01->control.int_ctl |= V_NMI_MASK;
>> +	else
>> +		vmcb01->control.int_ctl &= ~V_NMI_MASK;
> 
> As proposed, this needs to clear nmi_masked to avoid false positives.  The better
> approach is to not have any open coded accesses to svm->nmi_masked outside of
> flows that specifically need to deal with vNMI logic.
>
ok.

> E.g. svm_enable_nmi_window() reads the raw nmi_masked.
> 
>> +
>> +	if (vcpu->arch.nmi_pending) {
>> +		vcpu->arch.nmi_pending--;
>> +		vmcb01->control.int_ctl |= V_NMI_PENDING;
>> +	} else
> 
> Curly braces on all paths if any path needs 'em.
>

ok. 

>> +		vmcb01->control.int_ctl &= ~V_NMI_PENDING;
>> +}
> 
> ...
> 
>> + static bool svm_set_hw_nmi_pending(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_svm *svm = to_svm(vcpu);
>> +
>> +	if (!is_vnmi_enabled(svm))
>> +		return false;
>> +
>> +	if (svm->vmcb->control.int_ctl & V_NMI_PENDING)
>> +		return false;
>> +
>> +	svm->vmcb->control.int_ctl |= V_NMI_PENDING;
>> +	vmcb_mark_dirty(svm->vmcb, VMCB_INTR);
>> +
>> +	/*
>> +	 * NMI isn't yet technically injected but
>> +	 * this rough estimation should be good enough
> 
> Again, wrap at 80 chars, not at random points.
>

ok.

>> +	 */
>> +	++vcpu->stat.nmi_injections;
>> +
>> +	return true;
>> +}
>> +
> 
> ...
> 
>>  bool svm_interrupt_blocked(struct kvm_vcpu *vcpu)
>>  {
>>  	struct vcpu_svm *svm = to_svm(vcpu);
>> @@ -3725,10 +3772,16 @@ static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
>>  	/*
>>  	 * Something prevents NMI from been injected. Single step over possible
>>  	 * problem (IRET or exception injection or interrupt shadow)
>> +	 *
>> +	 * With vNMI we should never need an NMI window
>> +	 * (we can always inject vNMI either by setting VNMI_PENDING or by EVENTINJ)
> 
> Please honor the soft limit and avoid pronouns.  There's also no need to put the
> blurb in parantheses on its own line.
> 
> As for the code, I believe there are bugs.  Pulling in the context...
> 
> 	if (svm->nmi_masked && !svm->awaiting_iret_completion)
> 		return; /* IRET will cause a vm exit */
> 
> Checking nmi_masked is wrong, this should use the helper.  Even if this code can

Right,.

> only be reached on error, it should still try its best to not make things worse.
> 
> 	if (!gif_set(svm)) {
> 		if (vgif)
> 			svm_set_intercept(svm, INTERCEPT_STGI);
> 		return; /* STGI will cause a vm exit */
> 	}
> 
> 	/*
> 	 * Something prevents NMI from been injected. Single step over possible
> 	 * problem (IRET or exception injection or interrupt shadow)
> 	 *
> 	 * With vNMI we should never need an NMI window
> 	 * (we can always inject vNMI either by setting VNMI_PENDING or by EVENTINJ)
> 	 */
> 	if (WARN_ON_ONCE(is_vnmi_enabled(svm)))
> 		return;
> 
> This is flawed, where "this" means handling of NMI windows when vNMI is enabled.
> 
> IIUC, if there are back-to-back NMIs, the intent is to set V_NMI for one and
> inject the other.  I believe there are multiple bugs svm_inject_nmi().  The one
> that's definitely a bug is setting svm->nmi_masked.  The other suspected bug,
> which is related to the above WARN, is setting the IRET intercept.  The resulting
> IRET interception will set awaiting_iret_completion, and then the above WARN is
> reachable (even if the masking check is fixed).
> 
> I don't think KVM needs to ever intercept IRET.  One NMI gets injected, and the
> other is sitting in INT_CTL.V_NMI_PENDING, i.e. there's no need for KVM to regain
> control.  If another NMI comes along before V_NMI_PENDING is handled, it will
> either get injected or dropped.
> 
> So I _think_ KVM can skip the intercept code when injecting an NMI, and this WARN
> can be hoisted to the top of svm_enable_nmi_window(), because as stated above, KVM
> should never need to request an NMI window.
> 
> Last thought, unless there's something that will obviously break, it's probably
> better to WARN and continue than to bail.  I.e. do the single-step and hope for
> the best.  Bailing at this point doesn't seem like it would help.
> 
>>  	 */
>> +	if (WARN_ON_ONCE(is_vnmi_enabled(svm)))
>> +		return;
>> +
>>  	svm->nmi_singlestep_guest_rflags = svm_get_rflags(vcpu);
>> -	svm->nmi_singlestep = true;
>>  	svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
>> +	svm->nmi_singlestep = true;
>>  }
> 

Ok.

So you mean.. In vNMI mode, KVM should never need to request NMI window and eventually
it reaches to NMI window then WARN_ON and cont.. to single step... so modified code change
may look something like below:

static void svm_enable_nmi_window(struct kvm_vcpu *vcpu)
{
        struct vcpu_svm *svm = to_svm(vcpu);

        /*
         * With vNMI we should never need an NMI window.
         * and if we reach here then better WARN and continue to single step.
         */
        WARN_ON_ONCE(is_vnmi_enabled(svm));

        if (svm_get_nmi_mask(vcpu) && !svm->awaiting_iret_completion)
                return; /* IRET will cause a vm exit */

        if (!gif_set(svm)) {
                if (vgif)
                        svm_set_intercept(svm, INTERCEPT_STGI);
                return; /* STGI will cause a vm exit */
        }

        /*
         * Something prevents NMI from been injected. Single step over possible
         * problem (IRET or exception injection or interrupt shadow)
         */

        svm->nmi_singlestep_guest_rflags = svm_get_rflags(vcpu);
        svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
        svm->nmi_singlestep = true;
}

Does that make sense?

Thanks,
Santosh

> ...
> 
>> @@ -553,6 +554,15 @@ static inline bool is_x2apic_msrpm_offset(u32 offset)
>>  	       (msr < (APIC_BASE_MSR + 0x100));
>>  }
>>  
>> +static inline bool is_vnmi_enabled(struct vcpu_svm *svm)
>> +{
>> +	/* L1's vNMI is inhibited while nested guest is running */
>> +	if (is_guest_mode(&svm->vcpu))
> 
> I would rather check the current VMCB.  I don't see any value in hardcoding the
> "KVM doesn't support vNMI in L2" in multiple places.  And I find the above comment
> about "L1's vNMI is inhibited" confusing.  vNMI isn't inhibited/blocked, KVM just
> doesn't utilize vNMI while L2 is active (IIUC, as proposed).
> 
>> +		return false;
>> +
>> +	return !!(svm->vmcb01.ptr->control.int_ctl & V_NMI_ENABLE);
>> +}
>> +
>>  /* svm.c */
>>  #define MSR_INVALID				0xffffffffU
>>  
>> -- 
>> 2.26.3
>>