This patch series adds memory tracking support for performant checkpoint/rollback implementations. It can also be used by live migration to improve predictability. Introduction Brendan Cully's Remus project white paper is one of the best written on the subject of fault tolerance using checkpoint/rollback techniques and is the best place to start for a general background. (http://www.cs.ubc.ca/~andy/papers/remus-nsdi-final.pdf) It gives a great outline of the basic requirements and characteristics of a checkpointed system, including a few of the performance issues. But Remus did not go far enough in the area of system performance for commercial production. This patch series addresses known bottlenecks and limitations in a checkpointed system: use of large bitmaps to track dirty memory, and lack of multi thread support due to mmu_lock being a spin lock. These modifications and still more modifications to qemu have allowed us to run checkpoint cycles at rates up to 2500 per second, while still allowing the VM to get useful work done. The patch series also helps to improve the predictability of live migrations of memory write intensive workloads. The qemu autoconverge feature helps such workloads by throttling CPUs to slow down memory writes. However, CPU throttling has unknown effect on guest and it is ineffective for workloads where memory write speed is not dependent on CPU execution speed. Checkpointing mode where VM is paused and dirty memory is harvested periodically will help in that regard. We have implemented a checkpointing-mode live migration, which we will put on github in the near future. Design Goals The patch series does not change or remove any existing KVM functionality. It represents only additional functions (ioctls) into KVM from user space and these changes coexist with the current dirty memory logging facilities. It is possible to run multiple QEMU instances such that some of the QEMUs perform live migration using the existing memory logging mechanism and others migrate or run in fault tolerant mode using the new memory tracking functions. Dynamic memory allocation and freeing is avoided during the checkpoint cycles in order to avoid surprises during performance-critical operations. The allocations and frees are done only when a VM enters or exits checkpoint mode. Once checkpoint mode is entered, a VM will typically run in this mode forever, where forever means until a fault occurs that leads to failover to the standby host, or the VM is shutdown, or a system administrator no longer want to run in FT mode. Modifications All modifications affect only the KVM instance where the primary (active) VM is running, and these modifications are not in play on the standby (passive) host, where this is a VM created that matches the primary in its configuration, but it does not execute until a migration/failover event occurs. Patch 1-3: New memory tracking ioctls and data structures that use dense list of guest frame numbers (instead of bitmap) Patch 4: Implement a dirty page threshold which when triggered forces vcpus to exit. Patch 5: Change mmu_lock to be rwlock_t to allow multiple threads to harvest and process dirty memory. Patch 6: Add documentation for new ioctls. Documentation/virtual/kvm/api.txt | 170 +++++ arch/x86/include/asm/kvm_host.h | 6 + arch/x86/kvm/mmu.c | 195 ++++-- arch/x86/kvm/page_track.c | 8 +- arch/x86/kvm/paging_tmpl.h | 10 +- arch/x86/kvm/x86.c | 15 +- include/linux/kvm_host.h | 56 +- include/uapi/linux/kvm.h | 95 +++ virt/kvm/kvm_main.c | 1011 ++++++++++++++++++++++++++- 9 files changed, 1485 insertions(+), 81 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html