On Fri, Sep 06, 2024, Jon Kohler wrote: > delay_halt_fn uses __tpause() with TPAUSE_C02_STATE, which is the power > optimized version of tpause, which according to documentation [3] is > a slower wakeup latency and higher power savings, with an added benefit > of being more SMT yield friendly. > > For datacenter, latency sensitive workloads, this is problematic as > the call to kvm_wait_lapic_expire happens directly prior to reentry > through vmx_vcpu_enter_exit, which is the exact wrong place for slow > wakeup latency. ... > So, with all of that said, there are a few things that could be done, > and I'm definitely open to ideas: > 1. Update delay_halt_tpause to use TPAUSE_C01_STATE unilaterally, which > anecdotally seems inline with the spirit of how AMD implemented > MWAITX, which uses the same delay_halt loop, and calls mwaitx with > MWAITX_DISABLE_CSTATES. > 2. Provide system level configurability to delay.c to optionally use > C01 as a config knob, maybe a compile leve setting? That way distros > aiming at low energy deployments could use that, but otherwise > default is low latency instead? > 3. Provide some different delay API that KVM could call, indicating it > wants low wakeup latency delays, if hardware supports it? > 4. Pull this code into kvm code directly (boooooo?) and manage it > directly instead of using delay.c (boooooo?) > 5. Something else? The option that would likely give the best of both worlds would be to prioritize lower wakeup latency for "small" delays. That could be done in __delay() and/or in KVM. E.g. delay_halt_tpause() quite clearly assumes a relatively long delay, which is a flawed assumption in this case. /* * Hard code the deeper (C0.2) sleep state because exit latency is * small compared to the "microseconds" that usleep() will delay. */ __tpause(TPAUSE_C02_STATE, edx, eax); The reason I say "and/or KVM" is that even without TPAUSE in the picture, it might make sense for KVM to avoid __delay() for anything but long delays. Both because the overhead of e.g. delay_tsc() could be higher than the delay itself, but also because the intent of KVM's delay is somewhat unique. By definition, KVM _knows_ there is an IRQ that is being deliver to the vCPU, i.e. entering the guest and running the vCPU asap is a priority. The _only_ reason KVM is waiting is to not violate the architecture. Reducing power consumption and even letting an SMT sibling run are arguably non-goals, i.e. it might be best for KVM to avoid even regular ol' PAUSE in this specific scenario, unless the wait time is so high that delaying VM-Enter more than the absolute bare minimum becomes a worthwhile tradeoff.