Hello, This RFC patch series provides facility to dedicate CPUs to KVM guests and enable the guests to handle interrupts from passed-through PCI devices directly (without VM exit and relay by the host). With this feature, we can improve throughput and response time of the device and the host's CPU usage by reducing the overhead of interrupt handling. This is good for the application using very high throughput/frequent interrupt device (e.g. 10GbE NIC). CPU-intensive high performance applications and real-time applicatoins also gets benefit from CPU isolation feature, which reduces VM exit and scheduling delay. Current implementation is still just PoC and have many limitations, but submitted for RFC. Any comments are appreciated. * Overview Intel and AMD CPUs have a feature to handle interrupts by guests without VM Exit. However, because it cannot switch VM Exit based on IRQ vectors, interrupts to both the host and the guest will be routed to guests. To avoid mixture of host and guest interrupts, in this patch, some of CPUs are cut off from the host and dedicated to the guests. In addition, IRQ affinity of the passed-through devices are set to the guest CPUs only. For IPI from the host to the guest, we use NMIs, that is an only interrupts having another VM Exit flag. * Benefits This feature provides benefits of virtualization to areas where high performance and low latency are required, such as HPC and trading, and so on. It also useful for consolidation in large scale systems with many CPU cores and PCI devices passed-through or with SR-IOV. For the future, it may be used to keep the guests running even if the host is crashed (but that would need additional features like memory isolation). * Limitations Current implementation is experimental, unstable, and has a lot of limitations. - SMP guests don't work correctly - Only Linux guest is supported - Only Intel VT-x is supported - Only MSI and MSI-X pass-through; no ISA interrupts support - Non passed-through PCI devices (including virtio) are slower - Kernel space PIT emulation does not work - Needs a lot of cleanups * How to test - Create a guest VM with 1 CPU and some PCI passthrough devices (which supports MSI/MSI-X). No VGA display will be better... - Apply the patch at the end of this mail to qemu-kvm. (This patch is just for simple testing, and dedicated CPU ID for the guest is hard-coded.) - Run the guest once to ensure the PCI passthrough works correctly. - Make the specified CPU offline. # echo 0 > /sys/devices/system/cpu/cpu3/online - Launch qemu-kvm with -no-kvm-pit option. The offlined CPU is booted as a slave CPU and guest is runs on that CPU. * Performance Example Tested under Xeon W3520, and 10Gb NIC (ixgbe 82599EB) with SR-IOV to share the device with the host and a guest. Using this NIC, we measured communication performance (throughput, latency, CPU usage) between the host and the guest. w/direct interrupts handling w/o direct interrupts handling Throughput(*1) 11.4 Gbits/sec 8.91 Gbits/sec Latency (*2) 0.054 ms 0.069 ms *1) measured with `iperf -s' on the host and `iperf -c' on the guest. *2) average `ping' RTT from the host to the guest CPU Usage (top output) - w/direct interrupts handling Tasks: 200 total, 1 running, 199 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 41.1%id, 0.0%wa, 0.0%hi, 58.9%si, 0.0%st Cpu1 : 0.0%us, 55.3%sy, 0.0%ni, 44.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.7%wa, 0.3%hi, 0.0%si, 0.0%st Mem: 6152492k total, 1921728k used, 4230764k free, 52544k buffers Swap: 8159228k total, 0k used, 8159228k free, 890964k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 32307 root 0 -20 165m 1088 772 S 56.5 0.0 1:33.03 iperf 1777 root 20 0 0 0 0 S 0.3 0.0 0:00.01 kworker/2:0 2121 sekiyama 20 0 15260 1372 1008 R 0.3 0.0 0:00.12 top 28792 qemu 20 0 820m 532m 8808 S 0.3 8.9 0:06.10 qemu-kvm.custom 1 root 20 0 37536 4684 2016 S 0.0 0.1 0:05.61 systemd - w/o direct interrupts handling Tasks: 193 total, 1 running, 192 sleeping, 0 stopped, 0 zombie Cpu0 : 0.0%us, 0.7%sy, 0.0%ni, 22.2%id, 0.0%wa, 0.3%hi, 76.8%si, 0.0%st Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 98.3%id, 0.0%wa, 1.7%hi, 0.0%si, 0.0%st Cpu2 : 0.3%us, 74.7%sy, 0.0%ni, 23.0%id, 0.0%wa, 2.0%hi, 0.0%si, 0.0%st Cpu3 : 94.7%us, 4.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.7%hi, 0.0%si, 0.0%st Mem: 6152492k total, 1586520k used, 4565972k free, 47832k buffers Swap: 8159228k total, 0k used, 8159228k free, 644460k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1747 qemu 20 0 844m 530m 8808 S 99.2 8.8 0:23.85 qemu-kvm.custom 1929 root 0 -20 165m 1080 772 S 70.9 0.0 0:09.96 iperf 1804 root -51 0 0 0 0 S 3.0 0.0 0:00.45 irq/74-kvm:0000 1803 root -51 0 0 0 0 S 2.6 0.0 0:00.40 irq/73-kvm:0000 1833 sekiyama 20 0 15260 1372 1004 R 0.3 0.0 0:00.13 top With direct interrupt handling, Guest execution is not included in top since the dedicated CPU is offlined from the host. And CPU usage by interrupt relay kernel thread (irq/*-kvm:0000) is reduced. * Patch to qemu-kvm for testing diff -u -r qemu-kvm-0.15.1/qemu-kvm-x86.c qemu-kvm-0.15.1-test/qemu-kvm-x86.c --- qemu-kvm-0.15.1/qemu-kvm-x86.c 2011-10-19 22:54:48.000000000 +0900 +++ qemu-kvm-0.15.1-test/qemu-kvm-x86.c 2012-06-25 21:21:15.141557256 +0900 @@ -139,12 +139,28 @@ return kvm_vcpu_ioctl(env, KVM_TPR_ACCESS_REPORTING, &tac); } +static int kvm_set_slave_cpu(CPUState *env) +{ + int r, slave = 3; + + r = kvm_ioctl(env->kvm_state, KVM_CHECK_EXTENSION, KVM_CAP_SLAVE_CPU); + if (r <= 0) { + return -ENOSYS; + } + r = kvm_vcpu_ioctl(env, KVM_SET_SLAVE_CPU, slave); + if (r < 0) + perror("kvm_set_slave_cpu"); + return r; +} + static int _kvm_arch_init_vcpu(CPUState *env) { kvm_arch_reset_vcpu(env); kvm_enable_tpr_access_reporting(env); + kvm_set_slave_cpu(env); + return kvm_update_ioport_access(env); } --- Tomoki Sekiyama (18): x86: request TLB flush to slave CPU using NMI KVM: route assigned devices' MSI/MSI-X directly to guests on slave CPUs KVM: add kvm_arch_vcpu_prevent_run to prevent VM ENTER when NMI is received KVM: vmx: Add definitions PIN_BASED_PREEMPTION_TIMER KVM: Directly handle interrupts by guests without VM EXIT on slave CPUs x86/apic: IRQ vector remapping on slave for slave CPUs x86/apic: Enable external interrupt routing to slave CPUs KVM: no exiting from guest when slave CPU halted KVM: proxy slab operations for slave CPUs on online CPUs KVM: Go back to online CPU on VM exit by external interrupt KVM: Add KVM_GET_SLAVE_CPU and KVM_SET_SLAVE_CPU to vCPU ioctl KVM: handle page faults occured in slave CPUs on online CPUs KVM: Add facility to run guests on slave CPUs KVM: Enable/Disable virtualization on slave CPUs are activated/dying KVM: Replace local_irq_disable/enable with local_irq_save/restore x86: Support hrtimer on slave CPUs x86: Add a facility to use offlined CPUs as slave CPUs x86: Split memory hotplug function from cpu_up() as cpu_memory_up() arch/x86/Kconfig | 10 + arch/x86/include/asm/apic.h | 4 arch/x86/include/asm/cpu.h | 14 + arch/x86/include/asm/irq.h | 15 + arch/x86/include/asm/kvm_host.h | 56 +++++ arch/x86/include/asm/mmu.h | 7 + arch/x86/include/asm/vmx.h | 3 arch/x86/kernel/apic/apic_flat_64.c | 2 arch/x86/kernel/apic/io_apic.c | 89 ++++++- arch/x86/kernel/apic/x2apic_cluster.c | 6 arch/x86/kernel/apic/x2apic_phys.c | 2 arch/x86/kernel/cpu/common.c | 3 arch/x86/kernel/smp.c | 2 arch/x86/kernel/smpboot.c | 188 +++++++++++++++ arch/x86/kvm/irq.c | 136 +++++++++++ arch/x86/kvm/lapic.c | 6 arch/x86/kvm/mmu.c | 83 +++++-- arch/x86/kvm/mmu.h | 4 arch/x86/kvm/trace.h | 1 arch/x86/kvm/vmx.c | 74 ++++++ arch/x86/kvm/x86.c | 407 +++++++++++++++++++++++++++++++-- arch/x86/mm/gup.c | 7 - arch/x86/mm/tlb.c | 63 +++++ drivers/iommu/intel_irq_remapping.c | 10 + include/linux/cpu.h | 9 + include/linux/cpumask.h | 26 ++ include/linux/kvm.h | 4 include/linux/kvm_host.h | 2 kernel/cpu.c | 83 +++++-- kernel/hrtimer.c | 22 ++ kernel/irq/manage.c | 4 kernel/irq/migration.c | 2 kernel/irq/proc.c | 2 kernel/smp.c | 9 - virt/kvm/assigned-dev.c | 8 + virt/kvm/async_pf.c | 17 + virt/kvm/kvm_main.c | 40 +++ 37 files changed, 1296 insertions(+), 124 deletions(-) Thanks, -- Tomoki Sekiyama <tomoki.sekiyama.qu@xxxxxxxxxxx> Linux Technology Center Hitachi, Ltd., Yokohama Research Laboratory -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html