On Thu, Feb 05, 2015 at 09:34:06PM -0200, Marcelo Tosatti wrote: > On Thu, Feb 05, 2015 at 05:05:25PM +0100, Paolo Bonzini wrote: > > This patch introduces a new module parameter for the KVM module; when it > > is present, KVM attempts a bit of polling on every HLT before scheduling > > itself out via kvm_vcpu_block. > > > > This parameter helps a lot for latency-bound workloads---in particular > > I tested it with O_DSYNC writes with a battery-backed disk in the host. > > In this case, writes are fast (because the data doesn't have to go all > > the way to the platters) but they cannot be merged by either the host or > > the guest. KVM's performance here is usually around 30% of bare metal, > > or 50% if you use cache=directsync or cache=writethrough (these > > parameters avoid that the guest sends pointless flush requests, and > > at the same time they are not slow because of the battery-backed cache). > > The bad performance happens because on every halt the host CPU decides > > to halt itself too. When the interrupt comes, the vCPU thread is then > > migrated to a new physical CPU, and in general the latency is horrible > > because the vCPU thread has to be scheduled back in. > > > > With this patch performance reaches 60-65% of bare metal and, more > > important, 99% of what you get if you use idle=poll in the guest. This > > means that the tunable gets rid of this particular bottleneck, and more > > work can be done to improve performance in the kernel or QEMU. > > > > Of course there is some price to pay; every time an otherwise idle vCPUs > > is interrupted by an interrupt, it will poll unnecessarily and thus > > impose a little load on the host. The above results were obtained with > > a mostly random value of the parameter (2000000), and the load was around > > 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU. > > > > The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll, > > that can be used to tune the parameter. It counts how many HLT > > instructions received an interrupt during the polling period; each > > successful poll avoids that Linux schedules the VCPU thread out and back > > in, and may also avoid a likely trip to C1 and back for the physical CPU. > > > > While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second. > > Of these halts, almost all are failed polls. During the benchmark, > > instead, basically all halts end within the polling period, except a more > > or less constant stream of 50 per second coming from vCPUs that are not > > running the benchmark. The wasted time is thus very low. Things may > > be slightly different for Windows VMs, which have a ~10 ms timer tick. > > > > The effect is also visible on Marcelo's recently-introduced latency > > test for the TSC deadline timer. Though of course a non-RT kernel has > > awful latency bounds, the latency of the timer is around 8000-10000 clock > > cycles compared to 20000-120000 without setting halt_poll. For the TSC > > deadline timer, thus, the effect is both a smaller average latency and > > a smaller variance. > > > > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > > --- > > arch/x86/include/asm/kvm_host.h | 1 + > > arch/x86/kvm/x86.c | 28 ++++++++++++++++++++++++---- > > include/linux/kvm_host.h | 1 + > > virt/kvm/kvm_main.c | 22 +++++++++++++++------- > > 4 files changed, 41 insertions(+), 11 deletions(-) > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > > index 848947ac6ade..a236e39cc385 100644 > > --- a/arch/x86/include/asm/kvm_host.h > > +++ b/arch/x86/include/asm/kvm_host.h > > @@ -655,6 +655,7 @@ struct kvm_vcpu_stat { > > u32 irq_window_exits; > > u32 nmi_window_exits; > > u32 halt_exits; > > + u32 halt_successful_poll; > > u32 halt_wakeup; > > u32 request_irq_exits; > > u32 irq_exits; > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > index 1373e04e1f19..b7b20828f01c 100644 > > --- a/arch/x86/kvm/x86.c > > +++ b/arch/x86/kvm/x86.c > > @@ -96,6 +96,9 @@ EXPORT_SYMBOL_GPL(kvm_x86_ops); > > static bool ignore_msrs = 0; > > module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR); > > > > +unsigned int halt_poll = 0; > > +module_param(halt_poll, uint, S_IRUGO | S_IWUSR); > > + > > unsigned int min_timer_period_us = 500; > > module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR); > > > > @@ -145,6 +148,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = { > > { "irq_window", VCPU_STAT(irq_window_exits) }, > > { "nmi_window", VCPU_STAT(nmi_window_exits) }, > > { "halt_exits", VCPU_STAT(halt_exits) }, > > + { "halt_successful_poll", VCPU_STAT(halt_successful_poll) }, > > { "halt_wakeup", VCPU_STAT(halt_wakeup) }, > > { "hypercalls", VCPU_STAT(hypercalls) }, > > { "request_irq", VCPU_STAT(request_irq_exits) }, > > @@ -5819,13 +5823,29 @@ void kvm_arch_exit(void) > > int kvm_emulate_halt(struct kvm_vcpu *vcpu) > > { > > ++vcpu->stat.halt_exits; > > - if (irqchip_in_kernel(vcpu->kvm)) { > > - vcpu->arch.mp_state = KVM_MP_STATE_HALTED; > > - return 1; > > - } else { > > + if (!irqchip_in_kernel(vcpu->kvm)) { > > vcpu->run->exit_reason = KVM_EXIT_HLT; > > return 0; > > } > > + > > + vcpu->arch.mp_state = KVM_MP_STATE_HALTED; > > + if (halt_poll) { > > + u64 start, curr; > > + rdtscll(start); > > + do { > > + /* > > + * This sets KVM_REQ_UNHALT if an interrupt > > + * arrives. > > + */ > > + if (kvm_vcpu_check_block(vcpu) < 0) { > > + ++vcpu->stat.halt_successful_poll; > > + break; > > + } > > + rdtscll(curr); > > + } while(!need_resched() && curr - start < halt_poll); > > + } > > + > > + return 1; > > } > > You want at least a basic procedure to estimate a value > (its a function of the device after all). > > Rather than halt_successful_poll's, i suppose the optimum > can be estimated from a dataset containing entries > in the form: > > interrupt time - hlt time > > Then choose a given value from that table. I meant a given percentile, with stats compiled from that table. > You can get the same out of halt_successful_poll, > but requires multiple runs of the test: > > Set halt_poll, run test, record halt_successful_poll. > Set halt_poll, run test, record halt_successful_poll. > Set halt_poll, run test, record halt_successful_poll. > ... > > A crude histogram also works, to avoid recording all "interrupt time - > hlt" entries and processing them in userspace. > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html