> On Sep 9, 2024, at 3:11 PM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > !-------------------------------------------------------------------| > CAUTION: External Email > > |-------------------------------------------------------------------! > > On Fri, Sep 06, 2024, Jon Kohler wrote: >> delay_halt_fn uses __tpause() with TPAUSE_C02_STATE, which is the power >> optimized version of tpause, which according to documentation [3] is >> a slower wakeup latency and higher power savings, with an added benefit >> of being more SMT yield friendly. >> >> For datacenter, latency sensitive workloads, this is problematic as >> the call to kvm_wait_lapic_expire happens directly prior to reentry >> through vmx_vcpu_enter_exit, which is the exact wrong place for slow >> wakeup latency. > > ... > >> So, with all of that said, there are a few things that could be done, >> and I'm definitely open to ideas: >> 1. Update delay_halt_tpause to use TPAUSE_C01_STATE unilaterally, which >> anecdotally seems inline with the spirit of how AMD implemented >> MWAITX, which uses the same delay_halt loop, and calls mwaitx with >> MWAITX_DISABLE_CSTATES. >> 2. Provide system level configurability to delay.c to optionally use >> C01 as a config knob, maybe a compile leve setting? That way distros >> aiming at low energy deployments could use that, but otherwise >> default is low latency instead? >> 3. Provide some different delay API that KVM could call, indicating it >> wants low wakeup latency delays, if hardware supports it? >> 4. Pull this code into kvm code directly (boooooo?) and manage it >> directly instead of using delay.c (boooooo?) >> 5. Something else? > > The option that would likely give the best of both worlds would be to prioritize > lower wakeup latency for "small" delays. That could be done in __delay() and/or > in KVM. E.g. delay_halt_tpause() quite clearly assumes a relatively long delay, > which is a flawed assumption in this case. > > /* > * Hard code the deeper (C0.2) sleep state because exit latency is > * small compared to the "microseconds" that usleep() will delay. > */ > __tpause(TPAUSE_C02_STATE, edx, eax); > > The reason I say "and/or KVM" is that even without TPAUSE in the picture, it might > make sense for KVM to avoid __delay() for anything but long delays. Both because > the overhead of e.g. delay_tsc() could be higher than the delay itself, but also > because the intent of KVM's delay is somewhat unique. > > By definition, KVM _knows_ there is an IRQ that is being deliver to the vCPU, i.e. > entering the guest and running the vCPU asap is a priority. The _only_ reason KVM > is waiting is to not violate the architecture. Reducing power consumption and > even letting an SMT sibling run are arguably non-goals, i.e. it might be best for > KVM to avoid even regular ol' PAUSE in this specific scenario, unless the wait > time is so high that delaying VM-Enter more than the absolute bare minimum > becomes a worthwhile tradeoff. Hey Sean - thanks for the sage advice as always. How about something like this for the regular ol’ PAUSE route? Note: the ndelay side would likely be a bit more annoying to handle to internalize to KVM, but perhaps we could just have delay library return the amount of cycles, and then do the loop I’ve got as a separate, KVM only func? --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1627,16 +1627,39 @@ static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu) static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles) { u64 timer_advance_ns = vcpu->arch.apic->lapic_timer.timer_advance_ns; + u64 start, end, delay_by = 0; /* * If the guest TSC is running at a different ratio than the host, then - * convert the delay to nanoseconds to achieve an accurate delay. Note - * that __delay() uses delay_tsc whenever the hardware has TSC, thus - * always for VMX enabled hardware. + * convert the delay to nanoseconds to achieve an accurate delay. + * + * Note: open code delay function as KVM's use case is a bit special, as + * we know we need to reenter the guest at a specific time; however, the + * delay library may introduce architectural delays that we do not want, + * such as using TPAUSE. Our mission is to simply get into the guest as + * soon as possible without violating architectural constraints. + * RFC: Keep ndelay for help converting to nsec? or pull that in too? */ if (vcpu->arch.tsc_scaling_ratio == kvm_caps.default_tsc_scaling_ratio) { - __delay(min(guest_cycles, - nsec_to_cycles(vcpu, timer_advance_ns))); + delay_by = min(guest_cycles, + nsec_to_cycles(vcpu, timer_advance_ns)); + + if (delay_by == 0) { + return; + } else { + start = rdtsc(); + + for (;;) { + cpu_relax(); + end = rdtsc(); + + if (delay_by <= end - start) + break; + + delay_by -= end - start; + start = end; + } + } } else { u64 delay_ns = guest_cycles * 1000000ULL; do_div(delay_ns, vcpu->arch.virtual_tsc_khz);