On 2/27/2025 8:02 PM, Sean Christopherson wrote: > On Thu, Feb 27, 2025, Manali Shukla wrote: >> On 2/26/2025 3:37 AM, Sean Christopherson wrote: >>>> @@ -5225,6 +5230,8 @@ static __init void svm_set_cpu_caps(void) >>>> if (vnmi) >>>> kvm_cpu_cap_set(X86_FEATURE_VNMI); >>>> >>>> + kvm_cpu_cap_check_and_set(X86_FEATURE_IDLE_HLT); >>> >>> I am 99% certain this is wrong. Or at the very least, severly lacking an >>> explanation of why it's correct. If L1 enables Idle HLT but not HLT interception, >>> then it is KVM's responsibility to NOT exit to L1 if there is a pending V_IRQ or >>> V_NMI. >>> >>> Yeah, it's buggy. But, it's buggy in part because *existing* KVM support is buggy. >>> If L1 disables HLT exiting, but it's enabled in KVM, then KVM will run L2 with >>> HLT exiting and so it becomes KVM's responsibility to check for valid L2 wake events >>> prior to scheduling out the vCPU if L2 executes HLT. E.g. nVMX handles this by >>> reading vmcs02.GUEST_INTERRUPT_STATUS.RVI as part of vmx_has_nested_events(). I >>> don't see the equivalent in nSVM. >>> >>> Amusingly, that means Idle HLT is actually a bug fix to some extent. E.g. if there >>> is a pending V_IRQ/V_NMI in vmcb02, then running with Idle HLT will naturally do >>> the right thing, i.e. not hang the vCPU. >>> >>> Anyways, for now, I think the easiest and best option is to simply skip full nested >>> support for the moment. >>> >> >> Got it, I see the issue you're talking about. I'll need to look into it a bit more to >> fully understand it. So yeah, we can hold off on full nested support for idle HLT >> intercept for now. >> >> Since we are planning to disable Idle HLT support on nested guests, should we do >> something like this ? >> >> @@ -167,10 +167,15 @@ void recalc_intercepts(struct vcpu_svm *svm) >> if (!nested_svm_l2_tlb_flush_enabled(&svm->vcpu)) >> vmcb_clr_intercept(c, INTERCEPT_VMMCALL); >> >> + if (!guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_IDLE_HLT)) >> + vmcb_clr_intercept(c, INTERCEPT_IDLE_HLT); >> + >> >> When recalc_intercepts copies the intercept values from vmc01 to vmcb02, it also copies >> the IDLE HLT intercept bit, which is set to 1 in vmcb01. Normally, this isn't a problem >> because the HLT intercept takes priority when it's on. But if the HLT intercept gets >> turned off for some reason, the IDLE HLT intercept will stay on, which is not what we >> want. > > Why don't we want that? The idle-HLT intercept remains '1' for the L2 guest. Now, when L2 executes HLT and there is no pending event available, it will still do idle-HLT exit, although Idle HLT was never explicitly enabled on L2 guest. I found this behavior by modifying the ipi_hlt_test a bit. static void l2_guest_code(void) { uint64_t icr_val; int i; x2apic_enable(); icr_val = (APIC_DEST_SELF | APIC_INT_ASSERT | INTR_VECTOR); for (i = 0; i < NUM_ITERATIONS; i++) { cli(); x2apic_write_reg(APIC_ICR, icr_val); safe_halt(); GUEST_ASSERT(READ_ONCE(irq_received)); WRITE_ONCE(irq_received, false); asm volatile("hlt"); } GUEST_DONE(); } static void l1_svm_code(struct svm_test_data *svm) { unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; struct vmcb *vmcb = svm->vmcb; generic_svm_setup(svm, l2_guest_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); vmcb->control.intercept &= ~BIT(INTERCEPT_HLT); run_guest(vmcb, svm->vmcb_gpa); GUEST_DONE(); } Let me know if I am missing something. -Manali