[PATCH 0/6] KVM: Dirty memory tracking for performant checkpointing and improved live migration

"Cao, Lei" <Lei.Cao@xxxxxxxxxxx> · Tue, 26 Apr 2016 19:21:05 +0000

This patch series adds memory tracking support for performant
checkpoint/rollback implementations. It can also be used by live 
migration to improve predictability. 

Introduction

Brendan Cully's Remus project white paper is one of the best written on 
the subject of fault tolerance using checkpoint/rollback techniques and 
is the best place to start for a general background. 
(http://www.cs.ubc.ca/~andy/papers/remus-nsdi-final.pdf)  
It gives a great outline of the basic requirements and characteristics 
of a checkpointed system, including a few of the performance issues.  
But Remus did not go far enough in the area of system performance for 
commercial production.

This patch series addresses known bottlenecks and limitations in a 
checkpointed system: use of large bitmaps to track dirty memory, and
lack of multi thread support due to mmu_lock being a spin lock. These 
modifications and still more modifications to qemu have allowed us 
to run checkpoint cycles at rates up to 2500 per second,  while still 
allowing the VM to get useful work done.

The patch series also helps to improve the predictability of live
migrations of memory write intensive workloads. The qemu autoconverge
feature helps such workloads by throttling CPUs to slow down memory writes.
However, CPU throttling has unknown effect on guest and it is
ineffective for workloads where memory write speed is not dependent
on CPU execution speed. Checkpointing mode where VM is paused and dirty
memory is harvested periodically will help in that regard. We have 
implemented a checkpointing-mode live migration, which we will put on
github in the near future.

Design Goals

The patch series does not change or remove any existing KVM functionality.
It represents only additional functions (ioctls) into KVM from user space 
and these changes coexist with the current dirty memory logging facilities. 
It is possible to run multiple QEMU instances such that some of the QEMUs 
perform live migration using the existing memory logging mechanism and 
others migrate or run in fault tolerant mode using the new memory tracking 
functions.  

Dynamic memory allocation and freeing is avoided during the checkpoint 
cycles in order to avoid surprises during performance-critical operations. 
The allocations and frees are done only when a VM enters or exits checkpoint 
mode. Once checkpoint mode is entered, a VM will typically run in this mode 
forever, where forever means until a fault occurs that leads to failover to 
the standby host, or the VM is shutdown, or a system administrator no longer 
want to run in FT mode.

Modifications

All modifications affect only the KVM instance where the primary (active) VM 
is running, and these modifications are not in play on the standby (passive) 
host, where this is a VM created that matches the primary in its configuration, 
but it does not execute until a migration/failover event occurs.

Patch 1-3: New memory tracking ioctls and data structures that use dense
           list of guest frame numbers (instead of bitmap)
Patch 4:   Implement a dirty page threshold which when triggered forces vcpus
           to exit.
Patch 5:   Change mmu_lock to be rwlock_t to allow multiple threads to 
           harvest and process dirty memory.
Patch 6:   Add documentation for new ioctls.

 Documentation/virtual/kvm/api.txt |  170 +++++
 arch/x86/include/asm/kvm_host.h   |    6 +
 arch/x86/kvm/mmu.c                |  195 ++++--
 arch/x86/kvm/page_track.c         |    8 +-
 arch/x86/kvm/paging_tmpl.h        |   10 +-
 arch/x86/kvm/x86.c                |   15 +-
 include/linux/kvm_host.h          |   56 +-
 include/uapi/linux/kvm.h          |   95 +++
 virt/kvm/kvm_main.c               | 1011 ++++++++++++++++++++++++++-
 9 files changed, 1485 insertions(+), 81 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html