This patchset introduces a new mechanism (dirty-quota-based throttling) to throttle the rate at which memory pages can be dirtied. This is done by setting a limit on the number of bytes that each vCPU is allowed to dirty at a time, until it is allocated additional quota. This new throttling mechanism is exposed to userspace through a new KVM capability, KVM_CAP_DIRTY_QUOTA. If this capability is enabled by userspace, each vCPU will exit to userspace (with exit reason KVM_EXIT_DIRTY_QUOTA_EXHAUSTED) as soon as its dirty quota is exhausted (in other words, a given vCPU will exit to userspace as soon as it has dirtied as many bytes as the limit set for it). When the vCPU exits to userspace, userspace may increase the dirty quota of the vCPU (after optionally sleeping for an appropriate period of time) so that it can continue dirtying more memory. Dirty-quota-based throttling is a very effective choice for live migration, for the following reasons: 1. With dirty-quota-based throttling, we can precisely set the amount of memory we can afford to dirty for the migration to converge (and within reasonable time). This behaviour is much more effective than the current state-of-the-art auto-converge mechanism that implements time-based throttling (making vCPUs sleep for some time to throttle dirtying), since some workloads can dirty a huge amount of memory even if its vCPUs are given a very small interval to run, thus causing migrations to take longer and possibly failing to converge. 2. While the current auto-converge mechanism makes the whole VM sleep to throttle memory dirtying, we can selectively throttle vCPUs with dirty-quota-based throttling (i.e. only causing vCPUs that are dirtying more than a threshold to sleep). Furthermore, if we choose very small intervals to compute and enforce dirty quota, we can achieve micro-stunning (i.e. stunning the vCPUs precisely when they are dirtying the memory). Both of these behaviors help the dirty-quota-based scheme to throttle only those vCPUs that are dirtying memory, and only when they are dirtying the memory. Hence, while the current auto-converge scheme is prone to throttling reads and writes equally, dirty-quota-based throttling has minimal impact on read performance. 3. Dirty-quota-based throttling can adapt quickly to changes in network bandwidth if it is enforced in very small intervals. In other words, we can consider the current available network bandwidth when computing an appropriate dirty quota for the next interval. The benefits of dirty-quota-based throttling are not limited to live migration. The dirty-quota mechanism can also be leveraged to support other use cases that would benefit from effective throttling of memory writes. The update_dirty_quota hook in the implementation can be used outside the context of live migration, but note that such alternative uses must also write-protect the memory. We have evaluated dirty-quota-based throttling using two key metrics: A. Live migration performance (time to migrate) B. Guest performance during live migration We have used a synthetic workload that dirties memory sequentially in a loop. It is characterised by three variables m, n and l. A given instance of this workload (m=x,n=y,l=z) is a workload dirtying x GB of memory with y threads at a rate of z GBps. In the following table, b is network bandwidth configured for the live migration, t_curr is the total time to migrate with the current throttling logic and t_dq is the total time to migrate with dirty-quota-based throttling. A. Live migration performance +--------+----+----------+----------+---------------+----------+----------+ | m (GB) | n | l (GBps) | b (MBps) | t_curr (s) | t_dq (s) | Diff (%) | +--------+----+----------+----------+---------------+----------+----------+ | 8 | 2 | 8.00 | 640 | 60.38 | 15.22 | 74.8 | | 16 | 4 | 1.26 | 640 | 75.99 | 32.22 | 57.6 | | 32 | 6 | 0.10 | 640 | 49.81 | 49.80 | 0.0 | | 48 | 8 | 2.20 | 640 | 287.78 | 115.65 | 59.8 | | 32 | 6 | 32.00 | 640 | 364.30 | 84.26 | 76.9 | | 8 | 2 | 8.00 | 128 | 452.91 | 94.99 | 79.0 | | 512 | 32 | 0.10 | 640 | 868.94 | 841.92 | 3.1 | | 16 | 4 | 1.26 | 64 | 1538.94 | 426.21 | 72.3 | | 32 | 6 | 1.80 | 1024 | 1406.80 | 452.82 | 67.8 | | 512 | 32 | 7.20 | 640 | 4561.30 | 906.60 | 80.1 | | 128 | 16 | 3.50 | 128 | 7009.98 | 1689.61 | 75.9 | | 16 | 4 | 16.00 | 64 | "Unconverged" | 461.47 | N/A | | 32 | 6 | 32.00 | 128 | "Unconverged" | 454.27 | N/A | | 512 | 32 | 512.00 | 640 | "Unconverged" | 917.37 | N/A | | 128 | 16 | 128.00 | 128 | "Unconverged" | 1946.00 | N/A | +--------+----+----------+----------+---------------+----------+----------+ B. Guest performance: +=====================+===================+===================+==========+ | Case | Guest Runtime (%) | Guest Runtime (%) | Diff (%) | +=====================+===================+===================+==========+ | | (Current) | (Dirty Quota) | | +---------------------+-------------------+-------------------+----------+ | Write-intensive | 26.4 | 35.3 | 33.7 | +---------------------+-------------------+-------------------+----------+ | Read-write-balanced | 40.6 | 70.8 | 74.4 | +---------------------+-------------------+-------------------+----------+ | Read-intensive | 63.1 | 81.8 | 29.6 | +---------------------+-------------------+-------------------+----------+ Guest Runtime (in percentage) in the above table is the percentage of time a guest vCPU is actually running, averaged across all vCPUs of the guest. For B, we have run variants of the afore-mentioned synthetic workload dirtying memory sequentially in a loop on some threads and just reading memory sequentially on the other threads. We have also conducted similar experiments with more realistic benchmarks / workloads e.g. redis, and obtained similar results. Dirty-quota-based throttling was presented in KVM Forum 2021. Please find the details here: https://kvmforum2021.sched.com/event/ke4A/dirty-quota-based-vm-live-migration-auto-converge-manish-mishra-shivam-kumar-nutanix-india The current v10 patchset includes the following changes over v9: 1. Use vma_pagesize as the dirty granularity for updating dirty quota on arm64. 2. Do not update dirty quota for instances where the hypervisor is writing into guest memory. Accounting for these instances in vCPUs' dirty quota is unfair to the vCPUs. Also, some of these instances, such as record_steal_time, frequently try to redundantly mark the same set of pages dirty again and again. To avoid these distortions, we had previously relied on checking the dirty bitmap to avoid redundantly updating quotas. Since we have now decoupled dirty-quota-based throttling from the live-migration dirty-tracking path, we have resolved this issue by simply avoiding the mis-accounting caused by these hypervisor-induced writes to guest memory. Through extensive experiments, we have verified that this new approach is approximately as effective as the prior approach that relied on checking the dirty bitmap. v1: https://lore.kernel.org/kvm/20211114145721.209219-1-shivam.kumar1@xxxxxxxxxxx/ v2: https://lore.kernel.org/kvm/Ydx2EW6U3fpJoJF0@xxxxxxxxxx/T/ v3: https://lore.kernel.org/kvm/YkT1kzWidaRFdQQh@xxxxxxxxxx/T/ v4: https://lore.kernel.org/all/20220521202937.184189-1-shivam.kumar1@xxxxxxxxxxx/ v5: https://lore.kernel.org/all/202209130532.2BJwW65L-lkp@xxxxxxxxx/T/ v6: https://lore.kernel.org/all/20220915101049.187325-1-shivam.kumar1@xxxxxxxxxxx/ v7: https://lore.kernel.org/all/a64d9818-c68d-1e33-5783-414e9a9bdbd1@xxxxxxxxxxx/t/ v8: https://lore.kernel.org/all/20230225204758.17726-1-shivam.kumar1@xxxxxxxxxxx/ v9: https://lore.kernel.org/kvm/20230504144328.139462-1-shivam.kumar1@xxxxxxxxxxx/ Thanks, Shivam Shivam Kumar (3): KVM: Implement dirty quota-based throttling of vcpus KVM: x86: Dirty quota-based throttling of vcpus KVM: arm64: Dirty quota-based throttling of vcpus Documentation/virt/kvm/api.rst | 17 +++++++++++++++++ arch/arm64/kvm/Kconfig | 1 + arch/arm64/kvm/arm.c | 5 +++++ arch/arm64/kvm/mmu.c | 1 + arch/x86/kvm/Kconfig | 1 + arch/x86/kvm/mmu/mmu.c | 6 +++++- arch/x86/kvm/mmu/spte.c | 1 + arch/x86/kvm/vmx/vmx.c | 3 +++ arch/x86/kvm/x86.c | 6 +++++- include/linux/kvm_host.h | 9 +++++++++ include/uapi/linux/kvm.h | 8 ++++++++ tools/include/uapi/linux/kvm.h | 1 + virt/kvm/Kconfig | 3 +++ virt/kvm/kvm_main.c | 27 +++++++++++++++++++++++++++ 14 files changed, 87 insertions(+), 2 deletions(-) -- 2.22.3