Double scheduling is a concern with virtualization hosts where the host schedules vcpus without knowing whats run by the vcpu and guest schedules tasks without knowing where the vcpu is physically running. This causes issues related to latencies, power consumption, resource utilization etc. An ideal solution would be to have a cooperative scheduling framework where the guest and host shares scheduling related information and makes an educated scheduling decision to optimally handle the workloads. As a first step, we are taking a stab at reducing latencies for latency sensitive workloads in the guest. This series of patches aims to implement a framework for dynamically managing the priority of vcpu threads based on the needs of the workload running on the vcpu. Latency sensitive workloads (nmi, irq, softirq, critcal sections, RT tasks etc) will get a boost from the host so as to minimize the latency. The host can proactively boost the vcpu threads when it has enough information about what is going to run on the vcpu - fo eg: injecting interrupts. For rest of the case, guest can request boost if the vcpu is not already boosted. The guest can subsequently request unboost after the latency sensitive workloads completes. Guest can also request a boost if needed. A shared memory region is used to communicate the scheduling information. Guest shares its needs for priority boosting and host shares the boosting status of the vcpu. Guest sets a flag when it needs a boost and continues running. Host reads this on next VMEXIT and boosts the vcpu thread. For unboosting, it is done synchronously so that host workloads can fairly compete with guests when guest is not running any latency sensitive workload. This RFC is x86 specific. This is mostly feature complete, but more work needs to be done on the following areas: - Use of paravirt ops framework. - Optimizing critical paths for speed, cache efficiency etc - Extensibility of this idea for sharing more scheduling information to make better educated scheduling decisions in guest and host. - Prevent misuse by rogue/buggy guest kernels Tests ------ Real world workload on chromeos shows considerable improvement. Audio and video applications running on low end devices experience high latencies when the system is under load. This patch helps in mitigating the audio and video glitches caused due to scheduling latencies. Following are the results from oboetester app on android vm running in chromeos. This app tests for audio glitches. ------------------------------------------------------- | | Noload || Busy | | Buffer Size |---------------------------------------- | | Vanilla | patches || Vanilla | Patches | ------------------------------------------------------- | 96 (2ms) | 20 | 4 || 1365 | 67 | ------------------------------------------------------- | 256 (4ms) | 3 | 1 || 524 | 23 | ------------------------------------------------------- | 512 (10ms) | 0 | 0 || 25 | 24 | ------------------------------------------------------- Noload: Tests run on idle system Busy: Busy system simulated by Speedometer benchmark The test shows considerable reduction in glitches especially with smaller buffer sizes. Following are data collected from few micro benchmark tests. cyclictest was run on a VM to measure the latency with and without the patches. We also took a baseline of the results with all vcpus statically boosted to RT(chrt). This is to observe the difference between dynamic and static boosting and its effect on host as well. Cyclictest on guest is to observe the effect of the patches on guest and cyclictest on host is to see if the patch affects workloads on the host. cyclictest is run on both host and guest. cyclictest cmdline: "cyclictest -q -D 90s -i 500 -d $INTERVAL" where $INTERVAL used was 500 and 1000 us. Host is Intel N4500 4C/4T. Guest also has 4 vcpus. In the following tables, Vanilla: baseline: vanilla kernel Dynamic: the patches applied Static: baseline: all vcpus statically boosted to RT(chrt) Idle tests ---------- The Host is idle and cyclictest on host and guest. ----------------------------------------------------------------------- | | Avg Latency(us): Guest || Avg Latency(us): Host | ----------------------------------------------------------------------- | Interval | vanilla | dynamic | static || vanilla | dynamic | static | ----------------------------------------------------------------------- | 500 | 9 | 9 | 10 || 5 | 3 | 3 | ----------------------------------------------------------------------- | 1000 | 34 | 35 | 35 || 5 | 3 | 3 | ---------------------------------------------------------------------- ----------------------------------------------------------------------- | | Max Latency(us): Guest || Max Latency(us): Host | ----------------------------------------------------------------------- | Interval | vanilla | dynamic | static || vanilla | dynamic | static | ----------------------------------------------------------------------- | 500 | 1577 | 1433 | 140 || 1577 | 1526 | 15969 | ----------------------------------------------------------------------- | 1000 | 6649 | 765 | 204 || 697 | 174 | 2444 | ----------------------------------------------------------------------- Busy Tests ---------- Here the a busy host was simulated using stress-ng and cyclictest was run on both host and guest. ----------------------------------------------------------------------- | | Avg Latency(us): Guest || Avg Latency(us): Host | ----------------------------------------------------------------------- | Interval | vanilla | dynamic | static || vanilla | dynamic | static | ----------------------------------------------------------------------- | 500 | 887 | 21 | 25 || 6 | 6 | 7 | ----------------------------------------------------------------------- | 1000 | 6335 | 45 | 38 || 11 | 11 | 14 | ---------------------------------------------------------------------- ----------------------------------------------------------------------- | | Max Latency(us): Guest || Max Latency(us): Host | ----------------------------------------------------------------------- | Interval | vanilla | dynamic | static || vanilla | dynamic | static | ----------------------------------------------------------------------- | 500 | 216835 | 13978 | 1728 || 2075 | 2114 | 2447 | ----------------------------------------------------------------------- | 1000 | 199575 | 70651 | 1537 || 1886 | 1285 | 27104 | ----------------------------------------------------------------------- These patches are rebased on 6.5.10. Patches 1-4: Implementation of the core host side feature Patch 5: A naive throttling mechanism for limiting boosted duration for preemption disabled state in the guest. This is a placeholder for the throttling mechanism for now and would need to be implemented differently Patch 6: Enable/disable tunables - global and per-vm Patches 7-8: Implementation of the code guest side feature --- Vineeth Pillai (Google) (8): kvm: x86: MSR for setting up scheduler info shared memory sched/core: sched_setscheduler_pi_nocheck for interrupt context usage kvm: x86: vcpu boosting/unboosting framework kvm: x86: boost vcpu threads on latency sensitive paths kvm: x86: upper bound for preemption based boost duration kvm: x86: enable/disable global/per-guest vcpu boost feature sched/core: boost/unboost in guest scheduler irq: boost/unboost in irq/nmi entry/exit and softirq arch/x86/Kconfig | 13 +++ arch/x86/include/asm/kvm_host.h | 69 ++++++++++++ arch/x86/include/asm/kvm_para.h | 7 ++ arch/x86/include/uapi/asm/kvm_para.h | 43 ++++++++ arch/x86/kernel/kvm.c | 16 +++ arch/x86/kvm/Kconfig | 12 +++ arch/x86/kvm/cpuid.c | 2 + arch/x86/kvm/i8259.c | 2 +- arch/x86/kvm/lapic.c | 8 +- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/vmx/vmx.c | 2 +- arch/x86/kvm/x86.c | 154 +++++++++++++++++++++++++++ include/linux/kvm_host.h | 56 ++++++++++ include/linux/sched.h | 23 ++++ include/uapi/linux/kvm.h | 5 + kernel/entry/common.c | 39 +++++++ kernel/sched/core.c | 127 +++++++++++++++++++++- kernel/softirq.c | 11 ++ virt/kvm/kvm_main.c | 150 ++++++++++++++++++++++++++ 19 files changed, 730 insertions(+), 11 deletions(-) -- 2.43.0