Re: KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 9 Sep 2024 12:11:52 -0700

On Fri, Sep 06, 2024, Jon Kohler wrote:
> delay_halt_fn uses __tpause() with TPAUSE_C02_STATE, which is the power
> optimized version of tpause, which according to documentation [3] is
> a slower wakeup latency and higher power savings, with an added benefit
> of being more SMT yield friendly.
> 
> For datacenter, latency sensitive workloads, this is problematic as
> the call to kvm_wait_lapic_expire happens directly prior to reentry
> through vmx_vcpu_enter_exit, which is the exact wrong place for slow
> wakeup latency.

...

> So, with all of that said, there are a few things that could be done,
> and I'm definitely open to ideas:
> 1. Update delay_halt_tpause to use TPAUSE_C01_STATE unilaterally, which
> anecdotally seems inline with the spirit of how AMD implemented
> MWAITX, which uses the same delay_halt loop, and calls mwaitx with
> MWAITX_DISABLE_CSTATES. 
> 2. Provide system level configurability to delay.c to optionally use
> C01 as a config knob, maybe a compile leve setting? That way distros
> aiming at low energy deployments could use that, but otherwise
> default is low latency instead?
> 3. Provide some different delay API that KVM could call, indicating it
> wants low wakeup latency delays, if hardware supports it?
> 4. Pull this code into kvm code directly (boooooo?) and manage it
> directly instead of using delay.c (boooooo?)
> 5. Something else?

The option that would likely give the best of both worlds would be to prioritize
lower wakeup latency for "small" delays.  That could be done in __delay() and/or
in KVM.  E.g. delay_halt_tpause() quite clearly assumes a relatively long delay,
which is a flawed assumption in this case.

	/*
	 * Hard code the deeper (C0.2) sleep state because exit latency is
	 * small compared to the "microseconds" that usleep() will delay.
	 */
	__tpause(TPAUSE_C02_STATE, edx, eax);

The reason I say "and/or KVM" is that even without TPAUSE in the picture, it might
make sense for KVM to avoid __delay() for anything but long delays.  Both because
the overhead of e.g. delay_tsc() could be higher than the delay itself, but also
because the intent of KVM's delay is somewhat unique.

By definition, KVM _knows_ there is an IRQ that is being deliver to the vCPU, i.e.
entering the guest and running the vCPU asap is a priority.  The _only_ reason KVM
is waiting is to not violate the architecture.  Reducing power consumption and
even letting an SMT sibling run are arguably non-goals, i.e. it might be best for
KVM to avoid even regular ol' PAUSE in this specific scenario, unless the wait
time is so high that delaying VM-Enter more than the absolute bare minimum
becomes a worthwhile tradeoff.