KVM: x86: __wait_lapic_expire silently using TPAUSE C0.2

Jon Kohler <jon@xxxxxxxxxxx> · Fri, 6 Sep 2024 17:57:17 +0000

Reaching out to report an observation and get some advice.

Comments in __wait_lapic_expire introduced on [1] are no longer 
completely accurate, as __delay() will not call delay_tsc on systems
that support WAITPKG, such as Intel Sapphire Rapids and higher.
Instead, such systems will have their delay_fn configured to do
delay_halt, which calls delay_halt_fn in a loop until the amount of
cycles has passed. This was introduced on [2].

delay_halt_fn uses __tpause() with TPAUSE_C02_STATE, which is the power
optimized version of tpause, which according to documentation [3] is
a slower wakeup latency and higher power savings, with an added benefit
of being more SMT yield friendly.

For datacenter, latency sensitive workloads, this is problematic as
the call to kvm_wait_lapic_expire happens directly prior to reentry
through vmx_vcpu_enter_exit, which is the exact wrong place for slow
wakeup latency.

Intel has a nice paper [4] that talks about TPAUSE in the context of
getting better power utilization using DPDK polling, which has a bunch
of neat measurements, facts, and figures. 

One stands out, according to Intel's paper in figure 5, TPAUSE
has 3.7 times the exit latency coming out of C0.2 when compared to 
C0.1, but it only saves ~15% power when comparing these two states.

Using TPAUSE_C02_STATE seems like the wrong behavior given the spirit
of kvm_wait_lapic_expire seems to be to delay ever so slightly and then
jump back into the guest as soon as that delay is over. If we're going
to have TPAUSE in the critical path, I *think* it should be using
TPAUSE_C01_STATE; however, there is no way to signal that at all.

Side note:
It's worth noting also that the delay_halt call does not do the same
things that delay_tsc does, which calls preempt_{enable|disable}() a
until the delay period if over. I'm not sure one way or the other if
this is the behavior we wanted in kvm_wait_lapic_expire in the first
place, so I'll reserve judgement.

So, with all of that said, there are a few things that could be done,
and I'm definitely open to ideas:
1. Update delay_halt_tpause to use TPAUSE_C01_STATE unilaterally, which
anecdotally seems inline with the spirit of how AMD implemented
MWAITX, which uses the same delay_halt loop, and calls mwaitx with
MWAITX_DISABLE_CSTATES. 
2. Provide system level configurability to delay.c to optionally use
C01 as a config knob, maybe a compile leve setting? That way distros
aiming at low energy deployments could use that, but otherwise
default is low latency instead?
3. Provide some different delay API that KVM could call, indicating it
wants low wakeup latency delays, if hardware supports it?
4. Pull this code into kvm code directly (boooooo?) and manage it
directly instead of using delay.c (boooooo?)
5. Something else?

[1] b6aa57c69cb ("KVM: lapic: Convert guest TSC to host time domain if necessary") 
[2] cec5f268cd0 ("x86/delay: Introduce TPAUSE delay") 
[3] https://www.felixcloutier.com/x86/tpause
[4] https://www.intel.com/content/www/us/en/content-details/751859/power-management-user-wait-instructions-power-saving-for-dpdk-pmd-polling-workloads-technology-guide.html