Re: [PATCH v6 2/3] KVM: SVM: Add Idle HLT intercept support

Manali Shukla <manali.shukla@xxxxxxx> · Thu, 27 Feb 2025 21:35:52 +0530

On 2/27/2025 8:02 PM, Sean Christopherson wrote:
> On Thu, Feb 27, 2025, Manali Shukla wrote:
>> On 2/26/2025 3:37 AM, Sean Christopherson wrote:
>>>> @@ -5225,6 +5230,8 @@ static __init void svm_set_cpu_caps(void)
>>>>  		if (vnmi)
>>>>  			kvm_cpu_cap_set(X86_FEATURE_VNMI);
>>>>  
>>>> +		kvm_cpu_cap_check_and_set(X86_FEATURE_IDLE_HLT);
>>>
>>> I am 99% certain this is wrong.  Or at the very least, severly lacking an
>>> explanation of why it's correct.  If L1 enables Idle HLT but not HLT interception,
>>> then it is KVM's responsibility to NOT exit to L1 if there is a pending V_IRQ or
>>> V_NMI.
>>>
>>> Yeah, it's buggy.  But, it's buggy in part because *existing* KVM support is buggy.
>>> If L1 disables HLT exiting, but it's enabled in KVM, then KVM will run L2 with
>>> HLT exiting and so it becomes KVM's responsibility to check for valid L2 wake events
>>> prior to scheduling out the vCPU if L2 executes HLT.  E.g. nVMX handles this by
>>> reading vmcs02.GUEST_INTERRUPT_STATUS.RVI as part of vmx_has_nested_events().  I
>>> don't see the equivalent in nSVM.
>>>
>>> Amusingly, that means Idle HLT is actually a bug fix to some extent.  E.g. if there
>>> is a pending V_IRQ/V_NMI in vmcb02, then running with Idle HLT will naturally do
>>> the right thing, i.e. not hang the vCPU.
>>>
>>> Anyways, for now, I think the easiest and best option is to simply skip full nested
>>> support for the moment.
>>>
>>
>> Got it, I see the issue you're talking about. I'll need to look into it a bit more to
>> fully understand it. So yeah, we can hold off on full nested support for idle HLT 
>> intercept for now.
>>
>> Since we are planning to disable Idle HLT support on nested guests, should we do
>> something like this ?
>>
>> @@ -167,10 +167,15 @@ void recalc_intercepts(struct vcpu_svm *svm)
>>         if (!nested_svm_l2_tlb_flush_enabled(&svm->vcpu))
>>                 vmcb_clr_intercept(c, INTERCEPT_VMMCALL);
>>
>> +       if (!guest_cpu_cap_has(&svm->vcpu, X86_FEATURE_IDLE_HLT))
>> +               vmcb_clr_intercept(c, INTERCEPT_IDLE_HLT);
>> +
>>
>> When recalc_intercepts copies the intercept values from vmc01 to vmcb02, it also copies
>> the IDLE HLT intercept bit, which is set to 1 in vmcb01. Normally, this isn't a problem 
>> because the HLT intercept takes priority when it's on. But if the HLT intercept gets 
>> turned off for some reason, the IDLE HLT intercept will stay on, which is not what we
>> want.
> 
> Why don't we want that?

The idle-HLT intercept remains '1' for the L2 guest. Now, when L2 executes HLT and there
is no pending event available, it will still do idle-HLT exit, although Idle HLT
was never explicitly enabled on L2 guest.

I found this behavior by modifying the ipi_hlt_test a bit.

static void l2_guest_code(void)
{
        uint64_t icr_val;
        int i;

        x2apic_enable();

        icr_val = (APIC_DEST_SELF | APIC_INT_ASSERT | INTR_VECTOR);

        for (i = 0; i < NUM_ITERATIONS; i++) {
                cli();
                x2apic_write_reg(APIC_ICR, icr_val);
                safe_halt();
                GUEST_ASSERT(READ_ONCE(irq_received));
                WRITE_ONCE(irq_received, false);
                asm volatile("hlt");
        }
        GUEST_DONE();
}
static void l1_svm_code(struct svm_test_data *svm)
{
        unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
        struct vmcb *vmcb = svm->vmcb;

        generic_svm_setup(svm, l2_guest_code,
                          &l2_guest_stack[L2_GUEST_STACK_SIZE]);
        vmcb->control.intercept &= ~BIT(INTERCEPT_HLT);

        run_guest(vmcb, svm->vmcb_gpa);
        GUEST_DONE();
}  

Let me know if I am missing something.

-Manali