> From: Sean Christopherson <seanjc@xxxxxxxxxx> > Sent: Wednesday, August 24, 2022 11:27 PM > To: Mi, Dapeng1 <dapeng1.mi@xxxxxxxxx> > Cc: rafael@xxxxxxxxxx; daniel.lezcano@xxxxxxxxxx; pbonzini@xxxxxxxxxx; linux- > pm@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; > zhenyuw@xxxxxxxxxxxxxxx > Subject: Re: [PATCH] KVM: x86: use TPAUSE to replace PAUSE in halt polling > > On Wed, Aug 24, 2022, Dapeng Mi wrote: > > TPAUSE is a new instruction on Intel processors which can instruct > > processor enters a power/performance optimized state. Halt polling > > uses PAUSE instruction to wait vCPU is waked up. The polling time > > could be long and cause extra power consumption in some cases. > > > > Use TPAUSE to replace the PAUSE instruction in halt polling to get a > > better power saving and performance. > > Better power savings, yes. Better performance? Not necessarily. Using TPAUSE > for a "successful" halt poll is likely to yield _worse_ performance from the > vCPU's perspective due to the increased latency. The Intel SDM says the C0.2 state which TPAUSE instruction currently enters in kernel would "Improves performance of the other SMT thread(s) on the same core." For some multi-thread workload, it could improve the performance. But considering the increased latency, I'm not sure how large the performance can improve. But we don't see obvious performance downgrade at least in our tests. > > > Signed-off-by: Dapeng Mi <dapeng1.mi@xxxxxxxxx> > > --- > > drivers/cpuidle/poll_state.c | 3 ++- > > include/linux/kvm_host.h | 20 ++++++++++++++++++++ > > virt/kvm/kvm_main.c | 2 +- > > 3 files changed, 23 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/cpuidle/poll_state.c > > b/drivers/cpuidle/poll_state.c index f7e83613ae94..51ec333cbf80 100644 > > --- a/drivers/cpuidle/poll_state.c > > +++ b/drivers/cpuidle/poll_state.c > > @@ -7,6 +7,7 @@ > > #include <linux/sched.h> > > #include <linux/sched/clock.h> > > #include <linux/sched/idle.h> > > +#include <linux/kvm_host.h> > > > > #define POLL_IDLE_RELAX_COUNT 200 > > > > @@ -25,7 +26,7 @@ static int __cpuidle poll_idle(struct cpuidle_device *dev, > > limit = cpuidle_poll_time(drv, dev); > > > > while (!need_resched()) { > > - cpu_relax(); > > + kvm_cpu_poll_pause(limit); > > poll_idle() absolutely should not be calling into KVM code. Ok, so how should we handle this case? Duplicate a piece of code? > > > if (loop_count++ < POLL_IDLE_RELAX_COUNT) > > continue; > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index > > f4519d3689e1..810e749949b7 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -35,6 +35,7 @@ > > #include <linux/interval_tree.h> > > #include <linux/rbtree.h> > > #include <linux/xarray.h> > > +#include <linux/delay.h> > > #include <asm/signal.h> > > > > #include <linux/kvm.h> > > @@ -2247,6 +2248,25 @@ static inline void > > kvm_handle_signal_exit(struct kvm_vcpu *vcpu) } #endif /* > > CONFIG_KVM_XFER_TO_GUEST_WORK */ > > > > +/* > > + * This function is intended to replace the cpu_relax function in > > + * halt polling. If TPAUSE instruction is supported, use TPAUSE > > + * instead fo PAUSE to get better power saving and performance. > > + * Selecting 1 us is a compromise between scheduling latency and > > + * power saving time. > > + */ > > +static inline void kvm_cpu_poll_pause(u64 timeout_ns) { #ifdef > > +CONFIG_X86 > > This is not preferred the way to insert arch-specific behavior into common KVM > code. > Assuming the goal is to avoid a function call, use an #ifndef here and then > #define the flag in x86's kvm_host.h, e.g. > > #ifndef CONFIG_HAVE_KVM_ARCH_HALT_POLL_PAUSE > static inline kvm_cpu_halt_poll_pause(u64 timeout_ns) { > cpu_relax(); > } > #endif > > It's not obvious that we need to avoid a call here though, in which case a > > __weak void kvm_arch_cpu_halt_poll_pause(struct kvm *kvm) > { > > } > > with an x86 implementation will suffice. Sure. Thanks. > > > > + if (static_cpu_has(X86_FEATURE_WAITPKG) && timeout_ns > 1000) > > + udelay(1); > > This is far too arbitrary. Wake events from other vCPU are not necessarily > accompanied by an IRQ, which means that delaying for 1us may really truly > delay for 1us before detecting a pending wake event. > > If this is something we want to utilize in KVM, it should be controllable by > userspace, probably via module param, and likely off by default. > > E.g. > > unsigned int halt_poll_tpause_ns; > > and then > > if (timeout_ns >= halt_poll_tpause_ns) > udelay(halt_poll_tpause_ns); > > with halt_poll_tpause_ns zeroed out during setup if TPAUSE isn't supported. I see. Thanks. > > I say "if", because I think this needs to come with performance numbers to show > the impact on guest latency so that KVM and its users can make an informed > decision. > And if it's unlikely that anyone will ever want to enable TPAUSE for halt polling, > then it's not worth the extra complexity in KVM. I ever run two scheduling related benchmarks, hackbench and schbench, I didn't see there are obvious performance impact. Here are the hackbench and schbench data on Intel ADL platform. Hackbench base TPAUSE %delta Group-1 0.056 0.052 7.14% Group-4 0.165 0.164 0.61% Group-8 0.313 0.309 1.28% Group-16 0.834 0.842 -0.96% Schbench - Latency percentiles (usec) base TPAUSE ./schbench -m 1 50.0th 15 13 99.0th 221 203 ./schbench -m 2 50.0th 26 23 99.0th 16368 16544 ./schbench -m 4 50.0th 56 60 99.0th 33984 34112 Anyway, I would run more performance tests and see whether there are obvious performance impact. > > > + else > > + cpu_relax(); > > +#else > > + cpu_relax(); > > +#endif > > +} > > + > > /* > > * This defines how many reserved entries we want to keep before we > > * kick the vcpu to the userspace to avoid dirty ring full. This > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index > > 584a5bab3af3..4afa776d21bd 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -3510,7 +3510,7 @@ void kvm_vcpu_halt(struct kvm_vcpu *vcpu) > > */ > > if (kvm_vcpu_check_block(vcpu) < 0) > > goto out; > > - cpu_relax(); > > + kvm_cpu_poll_pause(vcpu->halt_poll_ns); > > This is wrong, vcpu->halt_poll_ns is the total poll time, not the time remaining. > E.g. if the max poll time is 1001 ns, and KVM has already waited for 1000 ns, > then > udelay(1) will cause KVM to wait for ~2000ns total. There's always going to be > some amount of overrun, but overrun by a few ns is quite different than overrun > by a few thousand ns. Yes, actually I calculate the remaining time and pass it to TPAUSE in earlier change. But this would make the code become complex comparing "cpu_relax". I'm concerning the complex code may impact the performance, so just simply it. > > > poll_end = cur = ktime_get(); > > } while (kvm_vcpu_can_poll(cur, stop)); > > } > > -- > > 2.34.1 > >