Reaching out to report an observation and get some advice. Comments in __wait_lapic_expire introduced on [1] are no longer completely accurate, as __delay() will not call delay_tsc on systems that support WAITPKG, such as Intel Sapphire Rapids and higher. Instead, such systems will have their delay_fn configured to do delay_halt, which calls delay_halt_fn in a loop until the amount of cycles has passed. This was introduced on [2]. delay_halt_fn uses __tpause() with TPAUSE_C02_STATE, which is the power optimized version of tpause, which according to documentation [3] is a slower wakeup latency and higher power savings, with an added benefit of being more SMT yield friendly. For datacenter, latency sensitive workloads, this is problematic as the call to kvm_wait_lapic_expire happens directly prior to reentry through vmx_vcpu_enter_exit, which is the exact wrong place for slow wakeup latency. Intel has a nice paper [4] that talks about TPAUSE in the context of getting better power utilization using DPDK polling, which has a bunch of neat measurements, facts, and figures. One stands out, according to Intel's paper in figure 5, TPAUSE has 3.7 times the exit latency coming out of C0.2 when compared to C0.1, but it only saves ~15% power when comparing these two states. Using TPAUSE_C02_STATE seems like the wrong behavior given the spirit of kvm_wait_lapic_expire seems to be to delay ever so slightly and then jump back into the guest as soon as that delay is over. If we're going to have TPAUSE in the critical path, I *think* it should be using TPAUSE_C01_STATE; however, there is no way to signal that at all. Side note: It's worth noting also that the delay_halt call does not do the same things that delay_tsc does, which calls preempt_{enable|disable}() a until the delay period if over. I'm not sure one way or the other if this is the behavior we wanted in kvm_wait_lapic_expire in the first place, so I'll reserve judgement. So, with all of that said, there are a few things that could be done, and I'm definitely open to ideas: 1. Update delay_halt_tpause to use TPAUSE_C01_STATE unilaterally, which anecdotally seems inline with the spirit of how AMD implemented MWAITX, which uses the same delay_halt loop, and calls mwaitx with MWAITX_DISABLE_CSTATES. 2. Provide system level configurability to delay.c to optionally use C01 as a config knob, maybe a compile leve setting? That way distros aiming at low energy deployments could use that, but otherwise default is low latency instead? 3. Provide some different delay API that KVM could call, indicating it wants low wakeup latency delays, if hardware supports it? 4. Pull this code into kvm code directly (boooooo?) and manage it directly instead of using delay.c (boooooo?) 5. Something else? [1] b6aa57c69cb ("KVM: lapic: Convert guest TSC to host time domain if necessary") [2] cec5f268cd0 ("x86/delay: Introduce TPAUSE delay") [3] https://www.felixcloutier.com/x86/tpause [4] https://www.intel.com/content/www/us/en/content-details/751859/power-management-user-wait-instructions-power-saving-for-dpdk-pmd-polling-workloads-technology-guide.html