On 20.05.19 14:07, Christian Borntraeger wrote: > > > On 20.05.19 13:51, Paolo Bonzini wrote: >> On 17/05/19 19:48, Marcelo Tosatti wrote: >>> >>> The cpuidle_kvm driver allows the guest vcpus to poll for a specified >>> amount of time before halting. This provides the following benefits >>> to host side polling: >>> >>> 1) The POLL flag is set while polling is performed, which allows >>> a remote vCPU to avoid sending an IPI (and the associated >>> cost of handling the IPI) when performing a wakeup. >>> >>> 2) The HLT VM-exit cost can be avoided. >>> >>> The downside of guest side polling is that polling is performed >>> even with other runnable tasks in the host. >>> >>> Results comparing halt_poll_ns and server/client application >>> where a small packet is ping-ponged: >>> >>> host --> 31.33 >>> halt_poll_ns=300000 / no guest busy spin --> 33.40 (93.8%) >>> halt_poll_ns=0 / guest_halt_poll_ns=300000 --> 32.73 (95.7%) >>> >>> For the SAP HANA benchmarks (where idle_spin is a parameter >>> of the previous version of the patch, results should be the >>> same): >>> >>> hpns == halt_poll_ns >>> >>> idle_spin=0/ idle_spin=800/ idle_spin=0/ >>> hpns=200000 hpns=0 hpns=800000 >>> DeleteC06T03 (100 thread) 1.76 1.71 (-3%) 1.78 (+1%) >>> InsertC16T02 (100 thread) 2.14 2.07 (-3%) 2.18 (+1.8%) >>> DeleteC00T01 (1 thread) 1.34 1.28 (-4.5%) 1.29 (-3.7%) >>> UpdateC00T03 (1 thread) 4.72 4.18 (-12%) 4.53 (-5%) >> >> Hi Marcelo, >> >> some quick observations: >> >> 1) This is actually not KVM-specific, so the name and placement of the >> docs should be adjusted. >> >> 2) Regarding KVM-specific code, however, we could add an MSR so that KVM >> disables halt_poll_ns for this VM when this is active in the guest? > > The whole code looks pretty much architecture independent. I have also seen cases > on s390 where this kind of code would make sense. Can we try to make this > usable for other archs as well? I did a quick hack (not yet for the list as it contains some uglyness). and the code seems to run ok on s390. So any chance to move this into drivers/cpuidle/ so that !x86 can also enable that when appropriate? I actually agree with Paolo that we should disable host halt polling as soon as the guest does it. Maybe we should have some arch specific callback (that can be an MSR). > > >> >> 3) The spin time could use the same adaptive algorithm that KVM uses in >> the host. >> >> Thanks, >> >> Paolo >> >> >>> --- >>> Documentation/virtual/kvm/guest-halt-polling.txt | 39 ++++++++ >>> arch/x86/Kconfig | 9 + >>> arch/x86/kernel/Makefile | 1 >>> arch/x86/kernel/cpuidle_kvm.c | 105 +++++++++++++++++++++++ >>> arch/x86/kernel/process.c | 2 >>> 5 files changed, 155 insertions(+), 1 deletion(-) >>> >>> Index: linux-2.6.git/arch/x86/Kconfig >>> =================================================================== >>> --- linux-2.6.git.orig/arch/x86/Kconfig 2019-04-22 13:49:42.858303265 -0300 >>> +++ linux-2.6.git/arch/x86/Kconfig 2019-05-16 14:18:41.254852745 -0300 >>> @@ -805,6 +805,15 @@ >>> underlying device model, the host provides the guest with >>> timing infrastructure such as time of day, and system time >>> >>> +config KVM_CPUIDLE >>> + tristate "KVM cpuidle driver" >>> + depends on KVM_GUEST >>> + default y >>> + help >>> + This option enables KVM cpuidle driver, which allows to poll >>> + before halting in the guest (more efficient than polling in the >>> + host via halt_poll_ns for some scenarios). >>> + >>> config PVH >>> bool "Support for running PVH guests" >>> ---help--- >>> Index: linux-2.6.git/arch/x86/kernel/Makefile >>> =================================================================== >>> --- linux-2.6.git.orig/arch/x86/kernel/Makefile 2019-04-22 13:49:42.869303331 -0300 >>> +++ linux-2.6.git/arch/x86/kernel/Makefile 2019-05-17 12:59:51.673274881 -0300 >>> @@ -112,6 +112,7 @@ >>> obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o >>> >>> obj-$(CONFIG_KVM_GUEST) += kvm.o kvmclock.o >>> +obj-$(CONFIG_KVM_CPUIDLE) += cpuidle_kvm.o >>> obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o >>> obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o >>> obj-$(CONFIG_PARAVIRT_CLOCK) += pvclock.o >>> Index: linux-2.6.git/arch/x86/kernel/process.c >>> =================================================================== >>> --- linux-2.6.git.orig/arch/x86/kernel/process.c 2019-04-22 13:49:42.876303374 -0300 >>> +++ linux-2.6.git/arch/x86/kernel/process.c 2019-05-17 13:19:18.055435117 -0300 >>> @@ -580,7 +580,7 @@ >>> safe_halt(); >>> trace_cpu_idle_rcuidle(PWR_EVENT_EXIT, smp_processor_id()); >>> } >>> -#ifdef CONFIG_APM_MODULE >>> +#if defined(CONFIG_APM_MODULE) || defined(CONFIG_KVM_CPUIDLE_MODULE) >>> EXPORT_SYMBOL(default_idle); >>> #endif >>> >>> Index: linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c >>> =================================================================== >>> --- /dev/null 1970-01-01 00:00:00.000000000 +0000 >>> +++ linux-2.6.git/arch/x86/kernel/cpuidle_kvm.c 2019-05-17 13:38:02.553941356 -0300 >>> @@ -0,0 +1,105 @@ >>> +// SPDX-License-Identifier: GPL-2.0 >>> +/* >>> + * cpuidle driver for KVM guests. >>> + * >>> + * Copyright 2019 Red Hat, Inc. and/or its affiliates. >>> + * >>> + * This work is licensed under the terms of the GNU GPL, version 2. See >>> + * the COPYING file in the top-level directory. >>> + * >>> + * Authors: Marcelo Tosatti <mtosatti@xxxxxxxxxx> >>> + */ >>> + >>> +#include <linux/init.h> >>> +#include <linux/cpuidle.h> >>> +#include <linux/module.h> >>> +#include <linux/timekeeping.h> >>> +#include <linux/sched/idle.h> >>> + >>> +unsigned int guest_halt_poll_ns; >>> +module_param(guest_halt_poll_ns, uint, 0644); >>> + >>> +static int kvm_enter_idle(struct cpuidle_device *dev, >>> + struct cpuidle_driver *drv, int index) >>> +{ >>> + int do_halt = 0; >>> + >>> + /* No polling */ >>> + if (guest_halt_poll_ns == 0) { >>> + if (current_clr_polling_and_test()) { >>> + local_irq_enable(); >>> + return index; >>> + } >>> + default_idle(); >>> + return index; >>> + } >>> + >>> + local_irq_enable(); >>> + if (!current_set_polling_and_test()) { >>> + ktime_t now, end_spin; >>> + >>> + now = ktime_get(); >>> + end_spin = ktime_add_ns(now, guest_halt_poll_ns); >>> + >>> + while (!need_resched()) { >>> + cpu_relax(); >>> + now = ktime_get(); >>> + >>> + if (!ktime_before(now, end_spin)) { >>> + do_halt = 1; >>> + break; >>> + } >>> + } >>> + } >>> + >>> + if (do_halt) { >>> + /* >>> + * No events while busy spin window passed, >>> + * halt. >>> + */ >>> + local_irq_disable(); >>> + if (current_clr_polling_and_test()) { >>> + local_irq_enable(); >>> + return index; >>> + } >>> + default_idle(); >>> + } else { >>> + current_clr_polling(); >>> + } >>> + >>> + return index; >>> +} >>> + >>> +static struct cpuidle_driver kvm_idle_driver = { >>> + .name = "kvm_idle", >>> + .owner = THIS_MODULE, >>> + .states = { >>> + { /* entry 0 is for polling */ }, >>> + { >>> + .enter = kvm_enter_idle, >>> + .exit_latency = 0, >>> + .target_residency = 0, >>> + .power_usage = -1, >>> + .name = "KVM", >>> + .desc = "KVM idle", >>> + }, >>> + }, >>> + .safe_state_index = 0, >>> + .state_count = 2, >>> +}; >>> + >>> +static int __init kvm_cpuidle_init(void) >>> +{ >>> + return cpuidle_register(&kvm_idle_driver, NULL); >>> +} >>> + >>> +static void __exit kvm_cpuidle_exit(void) >>> +{ >>> + cpuidle_unregister(&kvm_idle_driver); >>> +} >>> + >>> +module_init(kvm_cpuidle_init); >>> +module_exit(kvm_cpuidle_exit); >>> +MODULE_LICENSE("GPL"); >>> +MODULE_AUTHOR("Marcelo Tosatti <mtosatti@xxxxxxxxxx>"); >>> + >>> Index: linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt >>> =================================================================== >>> --- /dev/null 1970-01-01 00:00:00.000000000 +0000 >>> +++ linux-2.6.git/Documentation/virtual/kvm/guest-halt-polling.txt 2019-05-17 13:36:39.274703710 -0300 >>> @@ -0,0 +1,39 @@ >>> +KVM guest halt polling >>> +====================== >>> + >>> +The cpuidle_kvm driver allows the guest vcpus to poll for a specified >>> +amount of time before halting. This provides the following benefits >>> +to host side polling: >>> + >>> + 1) The POLL flag is set while polling is performed, which allows >>> + a remote vCPU to avoid sending an IPI (and the associated >>> + cost of handling the IPI) when performing a wakeup. >>> + >>> + 2) The HLT VM-exit cost can be avoided. >>> + >>> +The downside of guest side polling is that polling is performed >>> +even with other runnable tasks in the host. >>> + >>> +Module Parameters >>> +================= >>> + >>> +The cpuidle_kvm module has 1 tuneable module parameter: guest_halt_poll_ns, >>> +the amount of time, in nanoseconds, that polling is performed before >>> +halting. >>> + >>> +This module parameter can be set from the debugfs files in: >>> + >>> + /sys/module/cpuidle_kvm/parameters/ >>> + >>> +Further Notes >>> +============= >>> + >>> +- Care should be taken when setting the guest_halt_poll_ns parameter as a >>> +large value has the potential to drive the cpu usage to 100% on a machine which >>> +would be almost entirely idle otherwise. >>> + >>> +- The effective amount of time that polling is performed is the host poll >>> +value (see halt-polling.txt) plus guest_halt_poll_ns. If all guests >>> +on a host system support and have properly configured guest_halt_poll_ns, >>> +then setting halt_poll_ns to 0 in the host is probably the best choice. >>> + >>> >>