Re: [PATCH RFC] kvm: x86: add halt_poll module parameter

Marcelo Tosatti <mtosatti@xxxxxxxxxx> · Thu, 5 Feb 2015 21:47:57 -0200

On Thu, Feb 05, 2015 at 09:34:06PM -0200, Marcelo Tosatti wrote:
> On Thu, Feb 05, 2015 at 05:05:25PM +0100, Paolo Bonzini wrote:
> > This patch introduces a new module parameter for the KVM module; when it
> > is present, KVM attempts a bit of polling on every HLT before scheduling
> > itself out via kvm_vcpu_block.
> > 
> > This parameter helps a lot for latency-bound workloads---in particular
> > I tested it with O_DSYNC writes with a battery-backed disk in the host.
> > In this case, writes are fast (because the data doesn't have to go all
> > the way to the platters) but they cannot be merged by either the host or
> > the guest.  KVM's performance here is usually around 30% of bare metal,
> > or 50% if you use cache=directsync or cache=writethrough (these
> > parameters avoid that the guest sends pointless flush requests, and
> > at the same time they are not slow because of the battery-backed cache).
> > The bad performance happens because on every halt the host CPU decides
> > to halt itself too.  When the interrupt comes, the vCPU thread is then
> > migrated to a new physical CPU, and in general the latency is horrible
> > because the vCPU thread has to be scheduled back in.
> > 
> > With this patch performance reaches 60-65% of bare metal and, more
> > important, 99% of what you get if you use idle=poll in the guest.  This
> > means that the tunable gets rid of this particular bottleneck, and more
> > work can be done to improve performance in the kernel or QEMU.
> > 
> > Of course there is some price to pay; every time an otherwise idle vCPUs
> > is interrupted by an interrupt, it will poll unnecessarily and thus
> > impose a little load on the host.  The above results were obtained with
> > a mostly random value of the parameter (2000000), and the load was around
> > 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
> > 
> > The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
> > that can be used to tune the parameter.  It counts how many HLT
> > instructions received an interrupt during the polling period; each
> > successful poll avoids that Linux schedules the VCPU thread out and back
> > in, and may also avoid a likely trip to C1 and back for the physical CPU.
> > 
> > While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
> > Of these halts, almost all are failed polls.  During the benchmark,
> > instead, basically all halts end within the polling period, except a more
> > or less constant stream of 50 per second coming from vCPUs that are not
> > running the benchmark.  The wasted time is thus very low.  Things may
> > be slightly different for Windows VMs, which have a ~10 ms timer tick.
> > 
> > The effect is also visible on Marcelo's recently-introduced latency
> > test for the TSC deadline timer.  Though of course a non-RT kernel has
> > awful latency bounds, the latency of the timer is around 8000-10000 clock
> > cycles compared to 20000-120000 without setting halt_poll.  For the TSC
> > deadline timer, thus, the effect is both a smaller average latency and
> > a smaller variance.
> > 
> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/x86.c              | 28 ++++++++++++++++++++++++----
> >  include/linux/kvm_host.h        |  1 +
> >  virt/kvm/kvm_main.c             | 22 +++++++++++++++-------
> >  4 files changed, 41 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 848947ac6ade..a236e39cc385 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -655,6 +655,7 @@ struct kvm_vcpu_stat {
> >  	u32 irq_window_exits;
> >  	u32 nmi_window_exits;
> >  	u32 halt_exits;
> > +	u32 halt_successful_poll;
> >  	u32 halt_wakeup;
> >  	u32 request_irq_exits;
> >  	u32 irq_exits;
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1373e04e1f19..b7b20828f01c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -96,6 +96,9 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops);
> >  static bool ignore_msrs = 0;
> >  module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
> >  
> > +unsigned int halt_poll = 0;
> > +module_param(halt_poll, uint, S_IRUGO | S_IWUSR);
> > +
> >  unsigned int min_timer_period_us = 500;
> >  module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
> >  
> > @@ -145,6 +148,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> >  	{ "irq_window", VCPU_STAT(irq_window_exits) },
> >  	{ "nmi_window", VCPU_STAT(nmi_window_exits) },
> >  	{ "halt_exits", VCPU_STAT(halt_exits) },
> > +	{ "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
> >  	{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
> >  	{ "hypercalls", VCPU_STAT(hypercalls) },
> >  	{ "request_irq", VCPU_STAT(request_irq_exits) },
> > @@ -5819,13 +5823,29 @@ void kvm_arch_exit(void)
> >  int kvm_emulate_halt(struct kvm_vcpu *vcpu)
> >  {
> >  	++vcpu->stat.halt_exits;
> > -	if (irqchip_in_kernel(vcpu->kvm)) {
> > -		vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> > -		return 1;
> > -	} else {
> > +	if (!irqchip_in_kernel(vcpu->kvm)) {
> >  		vcpu->run->exit_reason = KVM_EXIT_HLT;
> >  		return 0;
> >  	}
> > +
> > +	vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
> > +	if (halt_poll) {
> > +		u64 start, curr;
> > +		rdtscll(start);
> > +		do {
> > +			/*
> > +			 * This sets KVM_REQ_UNHALT if an interrupt
> > +			 * arrives.
> > +			 */
> > +			if (kvm_vcpu_check_block(vcpu) < 0) {
> > +				++vcpu->stat.halt_successful_poll;
> > +				break;
> > +			}
> > +			rdtscll(curr);
> > +		} while(!need_resched() && curr - start < halt_poll);
> > +	}
> > +
> > +	return 1;
> >  }
> 
> You want at least a basic procedure to estimate a value
> (its a function of the device after all).
> 
> Rather than halt_successful_poll's, i suppose the optimum 
> can be estimated from a dataset containing entries
> in the form:
> 
> interrupt time - hlt time
> 
> Then choose a given value from that table.

I meant a given percentile, with stats compiled from that table.

> You can get the same out of halt_successful_poll, 
> but requires multiple runs of the test:
> 
> Set halt_poll, run test, record halt_successful_poll.
> Set halt_poll, run test, record halt_successful_poll.
> Set halt_poll, run test, record halt_successful_poll.
> ...
> 
> A crude histogram also works, to avoid recording all "interrupt time -
> hlt" entries and processing them in userspace.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html