> From: Sean Christopherson <seanjc@xxxxxxxxxx> > Sent: Saturday, October 8, 2022 1:52 AM > To: Mi, Dapeng1 <dapeng1.mi@xxxxxxxxx> > Cc: pbonzini@xxxxxxxxxx; tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx; > dave.hansen@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx; linux- > kernel@xxxxxxxxxxxxxxx; zhenyuw@xxxxxxxxxxxxxxx > Subject: Re: [PATCH] KVM: x86: disable halt polling when powersave governor is > used > > On Thu, Sep 15, 2022, Dapeng Mi wrote: > > Halt polling is enabled by default even through the CPU frequency > > governor is configured to powersave. Generally halt polling would > > consume extra power and this's not identical with the intent of > > powersave governor. > > > > disabling halt polling in powersave governor can save the precious > > power in power critical case. > > > > FIO random read test on Alder Lake platform shows halt polling > > occupies ~17% CPU utilization and consume 7% extra CPU power. > > After disabling halt polling, CPU has more chance to enter deeper > > C-states (C1E%: 25.3% -> 33.4%, C10%: 4.4% -> 17.4%). > > > > On Alder Lake platform, we don't find there are obvious performance > > downgrade after disabling halt polling on FIO and Netperf cases. > > Netperf UDP_RR case runs from two VMs locate on two different physical > > machines. > > > > FIO(MB/s) Base Disable-halt-polling Delta% > > Rand-read 432.6 436.3 0.8% > > > > Netperf Base Disable-halt-polling Delta% > > UDP_RR 509.8 508.5 -0.3% > > > > Signed-off-by: Dapeng Mi <dapeng1.mi@xxxxxxxxx> > > --- > > arch/x86/kvm/x86.c | 17 ++++++++++++++++- > > 1 file changed, 16 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index > > d7374d768296..c0eb6574cbbb 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -13015,7 +13015,22 @@ bool kvm_vector_hashing_enabled(void) > > > > bool kvm_arch_no_poll(struct kvm_vcpu *vcpu) { > > - return (vcpu->arch.msr_kvm_poll_control & 1) == 0; > > + struct cpufreq_policy *policy = cpufreq_cpu_get(vcpu->cpu); > > Preemption is not disabled at this point, which means that using vcpu->cpu is > potentially unsafe. Given that cpufreq is refcounting the returned object, I gotta > imaging get migrated to a different pCPU would be problematic. Thanks for pointing this. Per my learning, even vCPU migrates to a different pCPU in the progress of getting the cpufreq policy, the only consequence is to get an outdated (maybe inaccurate) policy value. Halt polling mechanism still can get the updated and correct cpufreq policy in next time. And even we disable preemption in the process of obtaining cpufreq policy, the vCPU is still possible to be migrated a different pCPU after enabling preemption and before calling the halt polling judging logic. > > > + bool powersave = false; > > I don't see anything in here that's x86 specific. Unless I'm missing something, > this belongs in common KVM. > Yes, this is generic. > > + > > + /* > > + * Halt polling could consume much CPU power, if CPU frequency > > + * governor is set to "powersave", disable halt polling. > > + */ > > + if (policy) { > > + if ((policy->policy == CPUFREQ_POLICY_POWERSAVE) || > > + (policy->governor && > > Indentation is messed up. Sure. Would change. > > > + !strncmp(policy->governor->name, > "powersave", > > KVM should not be comparing magic strings. If the cpufreq subsystem can't get > policy->policy right, then that needs to be fixed. Yeah, using magic strings looks a little bit strange, but this is what is cpufreq doing. Currently cpufreq mechanism supports two kinds of drivers, one is the driver which has the built-in governor, like intel_pstate driver. For this kind of driver, the cpufreq governor is saved in the policy->policy field. The other is the traditional driver which is independent with cpufreq governor and the cpufreq governor type is saved in the governor->name field. For the second kind of cpufreq driver, the policy->policy field is meaningless and we have to read the governor name. > > > + CPUFREQ_NAME_LEN))) > > + powersave = true; > > + cpufreq_cpu_put(policy); > > + } > > + return ((vcpu->arch.msr_kvm_poll_control & 1) == 0) || powersave; > > Doing all of the above work if polling is disabled is silly. Correct. Would change. > > > } > > EXPORT_SYMBOL_GPL(kvm_arch_no_poll); > > All in all, _if_ we want to do this automatically and not let userspace decide how > to manage powersave vs. halt-poll, I think this should be more like: Thanks for your sample. Would change in V2 patch. > > --- > virt/kvm/kvm_main.c | 21 ++++++++++++++++++++- > 1 file changed, 20 insertions(+), 1 deletion(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index > e30f1b4ecfa5..01116859cb31 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -29,6 +29,7 @@ > #include <linux/file.h> > #include <linux/syscore_ops.h> > #include <linux/cpu.h> > +#include <linux/cpufreq.h> > #include <linux/sched/signal.h> > #include <linux/sched/mm.h> > #include <linux/sched/stat.h> > @@ -3483,6 +3484,23 @@ static inline void update_halt_poll_stats(struct > kvm_vcpu *vcpu, ktime_t start, > } > } > > +static bool kvm_cpufreq_no_halt_poll(struct kvm_vcpu *vcpu) { > + struct cpufreq_policy *policy; > + bool powersave = false; > + > + preempt_disable(); > + > + policy = cpufreq_cpu_get(vcpu->cpu); > + if (policy) { > + powersave = (policy->policy == CPUFREQ_POLICY_POWERSAVE); > + cpufreq_cpu_put(policy); > + } > + > + preempt_enable(); > + return powersave; > +} > + > /* > * Emulate a vCPU halt condition, e.g. HLT on x86, WFI on arm, etc... If halt > * polling is enabled, busy wait for a short time before blocking to avoid the > @@ -3491,7 +3509,8 @@ static inline void update_halt_poll_stats(struct > kvm_vcpu *vcpu, ktime_t start, > */ > void kvm_vcpu_halt(struct kvm_vcpu *vcpu) { > - bool halt_poll_allowed = !kvm_arch_no_poll(vcpu); > + const bool halt_poll_allowed = !kvm_arch_no_poll(vcpu) && > + !kvm_cpufreq_no_halt_poll(vcpu); > bool do_halt_poll = halt_poll_allowed && vcpu->halt_poll_ns; > ktime_t start, cur, poll_end; > bool waited = false; > > base-commit: e18d6152ff0f41b7f01f9817372022df04e0d354 > --