Re: [PATCH v6 2/3] KVM: SVM: Add Idle HLT intercept support

Manali Shukla <manali.shukla@xxxxxxx> · Thu, 27 Feb 2025 13:51:56 +0530




Hi Sean,

Thank you for reviewing my patches.

On 2/26/2025 3:37 AM, Sean Christopherson wrote:
> On Tue, Jan 28, 2025, Manali Shukla wrote:
>> From: Manali Shukla <Manali.Shukla@xxxxxxx>
>>
>> The hypervisor can intercept the HLT instruction by setting the
>> HLT-Intercept Bit in VMCB, causing a VMEXIT. This can be wasteful if
>> there are pending V_INTR and V_NMI events, as the hypervisor must then
>> initiate a VMRUN to handle them.
>>
>> If the HLT-Intercept Bit is cleared and the vCPU executes HLT while
>> there are pending V_INTR and V_NMI events, the hypervisor won’t detect
>> them, potentially causing indefinite suspension of the vCPU. This poses
>> a problem for enlightened guests who  wish to securely handle the
>> events.
>>
>> For Secure AVIC scenarios, if a guest does a HLT while an interrupt is
>> pending (in IRR), the hypervisor does not have a way to figure out
>> whether the guest needs to be re-entered, as it cannot read the guest
>> backing page.  The Idle HLT intercept feature allows the hypervisor to
>> intercept HLT execution only if there are no pending V_INTR and V_NMI
>> events.
>>
>> There are two use cases for the Idle HLT intercept feature:
>> - Secure VMs that wish to handle pending events securely without exiting
>>   to the hypervisor on HLT (Secure AVIC).
> 
> I honestly don't see any reason to talk about Secure AVIC.  It takes a *lot* of
> background reading to understand how Secure AVIC actually works with Idle HLT.
> Critically, it requires a *reduction* in acceleration relative to what the APM
> calls "legacy" AVIC (lol), in that cross-vCPU IPIs are *never* accelerated.
> Ignoring device posted interrupts, lack of IPI virtualization means that KVM
> always has visibility into *new* interrupts.  The only blind spot is self-IPI and
> interrupts that were already made pending by KVM.
> 
>> - Optimization for all the VMs to avoid a wasteful VMEXIT during HLT
>>   when there are pending events.
>>
>> On discovering the Idle HLT Intercept, the KVM hypervisor,
>> Sets the Idle HLT Intercept bit (bit (6), offset 0x14h) in the VMCB.
>> When the Idle HLT Intercept bit is set, HLT Intercept bit (bit (0),
>> offset 0xFh) should be cleared.
>>
>> Before entering the HLT state, the HLT instruction performs checks in
>> following order:
>> - The HLT intercept check, if set, it unconditionally triggers
>>   SVM_EXIT_HLT (0x78).
>> - The Idle HLT intercept check, if set and there are no pending V_INTR
>>   or V_NMI events, triggers SVM_EXIT_IDLE_HLT (0xA6).
>>
>> Details about the Idle HLT intercept feature can be found in AMD APM [1].
> 
> This is not a useful changelog.  While I do care about AMD's motivation for adding
> Idle HLT, it's far, far more important that the changelog cover (a) what is (and
> isn't) being changed in KVM, (b) what edge cases are (or aren't) being handled,
> and (c) why the KVM implementation is correct.
> 
>>
>> [1]: AMD64 Architecture Programmer's Manual Pub. 24593, April
>>      2024, Vol 2, 15.9 Instruction Intercepts (Table 15-7: IDLE_HLT).
>>      https://bugzilla.kernel.org/attachment.cgi?id=306250
> 
> *sigh*
> 
> The inscrutable reference to the APM is longer than the table entry.  Just copy
> the table entry.
> 
>   When both HLT and Idle HLT intercepts are active at the same time, the HLT
>   intercept takes priority. This intercept occurs only if a virtual interrupt
>   is not pending (V_INTR or V_NMI).
> 
>> Signed-off-by: Manali Shukla <Manali.Shukla@xxxxxxx>
>> Reviewed-by: Nikunj A Dadhania <nikunj@xxxxxxx>
>> ---
>>  arch/x86/include/asm/svm.h      |  1 +
>>  arch/x86/include/uapi/asm/svm.h |  2 ++
>>  arch/x86/kvm/svm/svm.c          | 13 ++++++++++---
>>  3 files changed, 13 insertions(+), 3 deletions(-)
> 
> ...
> 
>> @@ -5225,6 +5230,8 @@ static __init void svm_set_cpu_caps(void)
>>  		if (vnmi)
>>  			kvm_cpu_cap_set(X86_FEATURE_VNMI);
>>  
>> +		kvm_cpu_cap_check_and_set(X86_FEATURE_IDLE_HLT);
> 
> I am 99% certain this is wrong.  Or at the very least, severly lacking an
> explanation of why it's correct.  If L1 enables Idle HLT but not HLT interception,
> then it is KVM's responsibility to NOT exit to L1 if there is a pending V_IRQ or
> V_NMI.
> 
> Yeah, it's buggy.  But, it's buggy in part because *existing* KVM support is buggy.
> If L1 disables HLT exiting, but it's enabled in KVM, then KVM will run L2 with
> HLT exiting and so it becomes KVM's responsibility to check for valid L2 wake events
> prior to scheduling out the vCPU if L2 executes HLT.  E.g. nVMX handles this by
> reading vmcs02.GUEST_INTERRUPT_STATUS.RVI as part of vmx_has_nested_events().  I
> don't see the equivalent in nSVM.
> 
> Amusingly, that means Idle HLT is actually a bug fix to some extent.  E.g. if there
> is a pending V_IRQ/V_NMI in vmcb02, then running with Idle HLT will naturally do
> the right thing, i.e. not hang the vCPU.
> 
> Anyways, for now, I think the easiest and best option is to simply skip full nested
> support for the moment.
> 

Got it, I see the issue you're talking about. I'll need to look into it a bit more to
fully understand it. So yeah, we can hold off on full nested support for idle HLT 
intercept for now.

Since we are planning to disable Idle HLT support on nested guests, should we do
something like this ?

@@ -167,10 +167,15 @@ void recalc_intercepts(struct vcpu_svm *svm)
        if (!nested_svm_l2_tlb_flush_enabled(&svm->vcpu))
                vmcb_clr_intercept(c, INTERCEPT_VMMCALL);

+       if (!guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_IDLE_HLT))
+               vmcb_clr_intercept(c, INTERCEPT_IDLE_HLT);
+

When recalc_intercepts copies the intercept values from vmc01 to vmcb02, it also copies
the IDLE HLT intercept bit, which is set to 1 in vmcb01. Normally, this isn't a problem 
because the HLT intercept takes priority when it's on. But if the HLT intercept gets 
turned off for some reason, the IDLE HLT intercept will stay on, which is not what we
want.

Other than that, everything looks good to me. 

> I rewrote the changelog as I was going (I didn't expect to go down so many rabbit
> holes), and ended up with the below.  Please review and let me know if I missed
> any wrinkles and/or got anything wrong.
> 
> ---
> From 05281b3782b7f880b3afd68e0074bf3abf6d55a7 Mon Sep 17 00:00:00 2001
> From: Manali Shukla <Manali.Shukla@xxxxxxx>
> Date: Tue, 28 Jan 2025 12:48:11 +0000
> Subject: [PATCH] KVM: SVM: Add Idle HLT intercept support
> 
> Add support for "Idle HLT" interception on AMD CPUs, and enable Idle HLT
> interception instead of "normal" HLT interception for all VMs for which
> HLT-exiting is enabled.  Idle HLT provides a mild performance boost for
> all VM types, by avoiding a VM-Exit in the scenario where KVM would
> immediately "wake" and resume the vCPU.
> 
> Idle HLT makes HLT-exiting conditional on the vCPU not having a valid,
> unmasked interrupt.  Specifically, a VM-Exit occurs on execution of HLT
> if and only if there are no pending V_IRQ or V_NMI events.  Note, Idle
> is a replacement for full HLT interception, i.e. enabling HLT interception
> would result in all HLT instructions causing unconditional VM-Exits.  Per
> the APM:
> 
>  When both HLT and Idle HLT intercepts are active at the same time, the
>  HLT intercept takes priority. This intercept occurs only if a virtual
>  interrupt is not pending (V_INTR or V_NMI).
> 
> For KVM's use of V_IRQ (also called V_INTR in the APM) to detect interrupt
> windows, the net effect of enabling Idle HLT is that, if a virtual
> interupt is pending and unmasked at the time of HLT, the vCPU will take
> a V_IRQ intercept instead of a HLT intercept.
> 
> When AVIC is enabled, Idle HLT works as intended: the vCPU continues
> unimpeded and services the pending virtual interrupt.
> 
> Note, the APM's description of V_IRQ interaction with AVIC is quite
> confusing, and requires piecing together implied behavior.  Per the APM,
> when AVIC is enabled, V_IRQ *from the VMCB* is ignored:
> 
>   When AVIC mode is enabled for a virtual processor, the V_IRQ, V_INTR_PRIO,
>   V_INTR_VECTOR, and V_IGN_TPR fields in the VMCB are ignored.
> 
> Which seems to contradict the behavior of Idle HLT:
> 
>   This intercept occurs only if a virtual interrupt is not pending (V_INTR
>   or V_NMI).
> 
> What's not explicitly stated is that hardware's internal copy of V_IRQ
> (and related fields) *are* still active, i.e. are presumably used to cache
> information from the virtual APIC.
> 
> Handle Idle HLT exits as if they were normal HLT exits, e.g. don't try to
> optimize the handling under the assumption that there isn't a pending IRQ.
> Irrespective of AVIC, Idle HLT is inherently racy with respect to the vIRR,
> as KVM can set vIRR bits asychronously.
> 
> No changes are required to support KVM's use Idle HLT while running
> L2.  In fact, supporting Idle HLT is actually a bug fix to some extent.
> If L1 wants to intercept HLT, recalc_intercepts() will enable HLT
> interception in vmcb02 and forward the intercept to L1 as normal.
> 
> But if L1 does not want to intercept HLT, then KVM will run L2 with Idle
> HLT enabled and HLT interception disabled.  If a V_IRQ or V_NMI for L2
> becomes pending and L2 executes HLT, then use of Idle HLT will do the
> right thing, i.e. not #VMEXIT and instead deliver the virtual event.  KVM
> currently doesn't handle this scenario correctly, e.g. doesn't check V_IRQ
> or V_NMI in vmcs02 as part of kvm_vcpu_has_events().
> 
> Do not expose Idle HLT to L1 at this time, as supporting nested Idle HLT is
> more complex than just enumerating the feature, e.g. requires KVM to handle
> the aforementioned scenarios of V_IRQ and V_NMI at the time of exit.
> 
> Signed-off-by: Manali Shukla <Manali.Shukla@xxxxxxx>
> Reviewed-by: Nikunj A Dadhania <nikunj@xxxxxxx>
> Link: https://bugzilla.kernel.org/attachment.cgi?id=306250
> Link: https://lore.kernel.org/r/20250128124812.7324-3-manali.shukla@xxxxxxx
> [sean: rewrite changelog, drop nested "support"]
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
> ---
>  arch/x86/include/asm/svm.h      |  1 +
>  arch/x86/include/uapi/asm/svm.h |  2 ++
>  arch/x86/kvm/svm/svm.c          | 11 ++++++++---
>  3 files changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
> index e2fac21471f5..12a9dde1e842 100644
> --- a/arch/x86/include/asm/svm.h
> +++ b/arch/x86/include/asm/svm.h
> @@ -116,6 +116,7 @@ enum {
>  	INTERCEPT_INVPCID,
>  	INTERCEPT_MCOMMIT,
>  	INTERCEPT_TLBSYNC,
> +	INTERCEPT_IDLE_HLT = 166,
>  };
>  
>  
> diff --git a/arch/x86/include/uapi/asm/svm.h b/arch/x86/include/uapi/asm/svm.h
> index 1814b413fd57..ec1321248dac 100644
> --- a/arch/x86/include/uapi/asm/svm.h
> +++ b/arch/x86/include/uapi/asm/svm.h
> @@ -95,6 +95,7 @@
>  #define SVM_EXIT_CR14_WRITE_TRAP		0x09e
>  #define SVM_EXIT_CR15_WRITE_TRAP		0x09f
>  #define SVM_EXIT_INVPCID       0x0a2
> +#define SVM_EXIT_IDLE_HLT      0x0a6
>  #define SVM_EXIT_NPF           0x400
>  #define SVM_EXIT_AVIC_INCOMPLETE_IPI		0x401
>  #define SVM_EXIT_AVIC_UNACCELERATED_ACCESS	0x402
> @@ -224,6 +225,7 @@
>  	{ SVM_EXIT_CR4_WRITE_TRAP,	"write_cr4_trap" }, \
>  	{ SVM_EXIT_CR8_WRITE_TRAP,	"write_cr8_trap" }, \
>  	{ SVM_EXIT_INVPCID,     "invpcid" }, \
> +	{ SVM_EXIT_IDLE_HLT,     "idle-halt" }, \
>  	{ SVM_EXIT_NPF,         "npf" }, \
>  	{ SVM_EXIT_AVIC_INCOMPLETE_IPI,		"avic_incomplete_ipi" }, \
>  	{ SVM_EXIT_AVIC_UNACCELERATED_ACCESS,   "avic_unaccelerated_access" }, \
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 7640a84e554a..37e83bde8f9f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -1297,8 +1297,12 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
>  		svm_set_intercept(svm, INTERCEPT_MWAIT);
>  	}
>  
> -	if (!kvm_hlt_in_guest(vcpu->kvm))
> -		svm_set_intercept(svm, INTERCEPT_HLT);
> +	if (!kvm_hlt_in_guest(vcpu->kvm)) {
> +		if (cpu_feature_enabled(X86_FEATURE_IDLE_HLT))
> +			svm_set_intercept(svm, INTERCEPT_IDLE_HLT);
> +		else
> +			svm_set_intercept(svm, INTERCEPT_HLT);
> +	}
>  
>  	control->iopm_base_pa = iopm_base;
>  	control->msrpm_base_pa = __sme_set(__pa(svm->msrpm));
> @@ -3342,6 +3346,7 @@ static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>  	[SVM_EXIT_CR4_WRITE_TRAP]		= cr_trap,
>  	[SVM_EXIT_CR8_WRITE_TRAP]		= cr_trap,
>  	[SVM_EXIT_INVPCID]                      = invpcid_interception,
> +	[SVM_EXIT_IDLE_HLT]			= kvm_emulate_halt,
>  	[SVM_EXIT_NPF]				= npf_interception,
>  	[SVM_EXIT_RSM]                          = rsm_interception,
>  	[SVM_EXIT_AVIC_INCOMPLETE_IPI]		= avic_incomplete_ipi_interception,
> @@ -3504,7 +3509,7 @@ int svm_invoke_exit_handler(struct kvm_vcpu *vcpu, u64 exit_code)
>  		return interrupt_window_interception(vcpu);
>  	else if (exit_code == SVM_EXIT_INTR)
>  		return intr_interception(vcpu);
> -	else if (exit_code == SVM_EXIT_HLT)
> +	else if (exit_code == SVM_EXIT_HLT || exit_code == SVM_EXIT_IDLE_HLT)
>  		return kvm_emulate_halt(vcpu);
>  	else if (exit_code == SVM_EXIT_NPF)
>  		return npf_interception(vcpu);
> 
> base-commit: b9cd96a7ff9cc9ddf95de59d69afb174a9e90c6e