Re: [PATCH v6 2/3] KVM: SVM: Add Idle HLT intercept support

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 25 Feb 2025 14:07:25 -0800

On Tue, Jan 28, 2025, Manali Shukla wrote:
> From: Manali Shukla <Manali.Shukla@xxxxxxx>
> 
> The hypervisor can intercept the HLT instruction by setting the
> HLT-Intercept Bit in VMCB, causing a VMEXIT. This can be wasteful if
> there are pending V_INTR and V_NMI events, as the hypervisor must then
> initiate a VMRUN to handle them.
> 
> If the HLT-Intercept Bit is cleared and the vCPU executes HLT while
> there are pending V_INTR and V_NMI events, the hypervisor won’t detect
> them, potentially causing indefinite suspension of the vCPU. This poses
> a problem for enlightened guests who  wish to securely handle the
> events.
> 
> For Secure AVIC scenarios, if a guest does a HLT while an interrupt is
> pending (in IRR), the hypervisor does not have a way to figure out
> whether the guest needs to be re-entered, as it cannot read the guest
> backing page.  The Idle HLT intercept feature allows the hypervisor to
> intercept HLT execution only if there are no pending V_INTR and V_NMI
> events.
> 
> There are two use cases for the Idle HLT intercept feature:
> - Secure VMs that wish to handle pending events securely without exiting
>   to the hypervisor on HLT (Secure AVIC).

I honestly don't see any reason to talk about Secure AVIC.  It takes a *lot* of
background reading to understand how Secure AVIC actually works with Idle HLT.
Critically, it requires a *reduction* in acceleration relative to what the APM
calls "legacy" AVIC (lol), in that cross-vCPU IPIs are *never* accelerated.
Ignoring device posted interrupts, lack of IPI virtualization means that KVM
always has visibility into *new* interrupts.  The only blind spot is self-IPI and
interrupts that were already made pending by KVM.

> - Optimization for all the VMs to avoid a wasteful VMEXIT during HLT
>   when there are pending events.
> 
> On discovering the Idle HLT Intercept, the KVM hypervisor,
> Sets the Idle HLT Intercept bit (bit (6), offset 0x14h) in the VMCB.
> When the Idle HLT Intercept bit is set, HLT Intercept bit (bit (0),
> offset 0xFh) should be cleared.
> 
> Before entering the HLT state, the HLT instruction performs checks in
> following order:
> - The HLT intercept check, if set, it unconditionally triggers
>   SVM_EXIT_HLT (0x78).
> - The Idle HLT intercept check, if set and there are no pending V_INTR
>   or V_NMI events, triggers SVM_EXIT_IDLE_HLT (0xA6).
> 
> Details about the Idle HLT intercept feature can be found in AMD APM [1].

This is not a useful changelog.  While I do care about AMD's motivation for adding
Idle HLT, it's far, far more important that the changelog cover (a) what is (and
isn't) being changed in KVM, (b) what edge cases are (or aren't) being handled,
and (c) why the KVM implementation is correct.

> 
> [1]: AMD64 Architecture Programmer's Manual Pub. 24593, April
>      2024, Vol 2, 15.9 Instruction Intercepts (Table 15-7: IDLE_HLT).
>      https://bugzilla.kernel.org/attachment.cgi?id=306250

*sigh*

The inscrutable reference to the APM is longer than the table entry.  Just copy
the table entry.

  When both HLT and Idle HLT intercepts are active at the same time, the HLT
  intercept takes priority. This intercept occurs only if a virtual interrupt
  is not pending (V_INTR or V_NMI).

> Signed-off-by: Manali Shukla <Manali.Shukla@xxxxxxx>
> Reviewed-by: Nikunj A Dadhania <nikunj@xxxxxxx>
> ---
>  arch/x86/include/asm/svm.h      |  1 +
>  arch/x86/include/uapi/asm/svm.h |  2 ++
>  arch/x86/kvm/svm/svm.c          | 13 ++++++++++---
>  3 files changed, 13 insertions(+), 3 deletions(-)

...

> @@ -5225,6 +5230,8 @@ static __init void svm_set_cpu_caps(void)
>  		if (vnmi)
>  			kvm_cpu_cap_set(X86_FEATURE_VNMI);
>  
> +		kvm_cpu_cap_check_and_set(X86_FEATURE_IDLE_HLT);

I am 99% certain this is wrong.  Or at the very least, severly lacking an
explanation of why it's correct.  If L1 enables Idle HLT but not HLT interception,
then it is KVM's responsibility to NOT exit to L1 if there is a pending V_IRQ or
V_NMI.

Yeah, it's buggy.  But, it's buggy in part because *existing* KVM support is buggy.
If L1 disables HLT exiting, but it's enabled in KVM, then KVM will run L2 with
HLT exiting and so it becomes KVM's responsibility to check for valid L2 wake events
prior to scheduling out the vCPU if L2 executes HLT.  E.g. nVMX handles this by
reading vmcs02.GUEST_INTERRUPT_STATUS.RVI as part of vmx_has_nested_events().  I
don't see the equivalent in nSVM.

Amusingly, that means Idle HLT is actually a bug fix to some extent.  E.g. if there
is a pending V_IRQ/V_NMI in vmcb02, then running with Idle HLT will naturally do
the right thing, i.e. not hang the vCPU.

Anyways, for now, I think the easiest and best option is to simply skip full nested
support for the moment.

I rewrote the changelog as I was going (I didn't expect to go down so many rabbit
holes), and ended up with the below.  Please review and let me know if I missed
any wrinkles and/or got anything wrong.

---