Re: KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2

Jon Kohler <jon@xxxxxxxxxxx> · Tue, 10 Sep 2024 18:47:00 +0000

> On Sep 9, 2024, at 3:11 PM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Fri, Sep 06, 2024, Jon Kohler wrote:
>> delay_halt_fn uses __tpause() with TPAUSE_C02_STATE, which is the power
>> optimized version of tpause, which according to documentation [3] is
>> a slower wakeup latency and higher power savings, with an added benefit
>> of being more SMT yield friendly.
>> 
>> For datacenter, latency sensitive workloads, this is problematic as
>> the call to kvm_wait_lapic_expire happens directly prior to reentry
>> through vmx_vcpu_enter_exit, which is the exact wrong place for slow
>> wakeup latency.
> 
> ...
> 
>> So, with all of that said, there are a few things that could be done,
>> and I'm definitely open to ideas:
>> 1. Update delay_halt_tpause to use TPAUSE_C01_STATE unilaterally, which
>> anecdotally seems inline with the spirit of how AMD implemented
>> MWAITX, which uses the same delay_halt loop, and calls mwaitx with
>> MWAITX_DISABLE_CSTATES. 
>> 2. Provide system level configurability to delay.c to optionally use
>> C01 as a config knob, maybe a compile leve setting? That way distros
>> aiming at low energy deployments could use that, but otherwise
>> default is low latency instead?
>> 3. Provide some different delay API that KVM could call, indicating it
>> wants low wakeup latency delays, if hardware supports it?
>> 4. Pull this code into kvm code directly (boooooo?) and manage it
>> directly instead of using delay.c (boooooo?)
>> 5. Something else?
> 
> The option that would likely give the best of both worlds would be to prioritize
> lower wakeup latency for "small" delays.  That could be done in __delay() and/or
> in KVM.  E.g. delay_halt_tpause() quite clearly assumes a relatively long delay,
> which is a flawed assumption in this case.
> 
> /*
> * Hard code the deeper (C0.2) sleep state because exit latency is
> * small compared to the "microseconds" that usleep() will delay.
> */
> __tpause(TPAUSE_C02_STATE, edx, eax);
> 
> The reason I say "and/or KVM" is that even without TPAUSE in the picture, it might
> make sense for KVM to avoid __delay() for anything but long delays.  Both because
> the overhead of e.g. delay_tsc() could be higher than the delay itself, but also
> because the intent of KVM's delay is somewhat unique.
> 
> By definition, KVM _knows_ there is an IRQ that is being deliver to the vCPU, i.e.
> entering the guest and running the vCPU asap is a priority.  The _only_ reason KVM
> is waiting is to not violate the architecture.  Reducing power consumption and
> even letting an SMT sibling run are arguably non-goals, i.e. it might be best for
> KVM to avoid even regular ol' PAUSE in this specific scenario, unless the wait
> time is so high that delaying VM-Enter more than the absolute bare minimum
> becomes a worthwhile tradeoff.

Hey Sean - thanks for the sage advice as always.

How about something like this for the regular ol’ PAUSE route?

Note: the ndelay side would likely be a bit more annoying to handle to internalize
to KVM, but perhaps we could just have delay library return the amount of cycles,
and then do the loop I’ve got as a separate, KVM only func?

--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1627,16 +1627,39 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
 static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
 {
        u64 timer_advance_ns = vcpu->arch.apic->lapic_timer.timer_advance_ns;
+       u64 start, end, delay_by = 0;
 
        /*
         * If the guest TSC is running at a different ratio than the host, then
-        * convert the delay to nanoseconds to achieve an accurate delay.  Note
-        * that __delay() uses delay_tsc whenever the hardware has TSC, thus
-        * always for VMX enabled hardware.
+        * convert the delay to nanoseconds to achieve an accurate delay.
+        *
+        * Note: open code delay function as KVM's use case is a bit special, as
+        * we know we need to reenter the guest at a specific time; however, the
+        * delay library may introduce architectural delays that we do not want,
+        * such as using TPAUSE. Our mission is to simply get into the guest as
+        * soon as possible without violating architectural constraints.
+        * RFC: Keep ndelay for help converting to nsec? or pull that in too?
         */
        if (vcpu->arch.tsc_scaling_ratio == kvm_caps.default_tsc_scaling_ratio) {
-               __delay(min(guest_cycles,
-                       nsec_to_cycles(vcpu, timer_advance_ns)));
+               delay_by = min(guest_cycles, 
+                                          nsec_to_cycles(vcpu, timer_advance_ns));
+
+               if (delay_by == 0) {
+                       return;
+               } else {
+                       start = rdtsc();
+
+                       for (;;) {
+                               cpu_relax();
+                               end = rdtsc();
+
+                               if (delay_by <= end - start)
+                                       break;
+
+                               delay_by -= end - start;
+                               start = end;
+                       }
+               }
        } else {
                u64 delay_ns = guest_cycles * 1000000ULL;
                do_div(delay_ns, vcpu->arch.virtual_tsc_khz);