Hi all, We have been implementing the prototype of Kemari for KVM, and we're sending this message to share what we have now and TODO lists. Hopefully, we would like to get early feedback to keep us in the right direction. Although advanced approaches in the TODO lists are fascinating, we would like to run this project step by step while absorbing comments from the community. The current code is based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27. For those who are new to Kemari for KVM, please take a look at the following RFC which we posted last year. http://www.mail-archive.com/kvm@xxxxxxxxxxxxxxx/msg25022.html The transmission/transaction protocol, and most of the control logic is implemented in QEMU. However, we needed a hack in KVM to prevent rip from proceeding before synchronizing VMs. It may also need some plumbing in the kernel side to guarantee replayability of certain events and instructions, integrate the RAS capabilities of newer x86 hardware with the HA stack, as well as for optimization purposes, for example. Before going into details, we would like to show how Kemari looks. We prepared a demonstration video at the following location. For those who are not interested in the code, please take a look. The demonstration scenario is, 1. Play with a guest VM that has virtio-blk and virtio-net. # The guest image should be a NFS/SAN. 2. Start Kemari to synchronize the VM by running the following command in QEMU. Just add "-k" option to usual migrate command. migrate -d -k tcp:192.168.0.20:4444 3. Check the status by calling info migrate. 4. Go back to the VM to play chess animation. 5. Kill the the VM. (VNC client also disappears) 6. Press "c" to continue the VM on the other host. 7. Bring up the VNC client (Sorry, it pops outside of video capture.) 8. Confirm that the chess animation ends, browser works fine, then shutdown. http://www.osrg.net/kemari/download/kemari-kvm-fc11.mov The repository contains all patches we're sending with this message. For those who want to try, pull the following repository. At running configure, please put --enable-ft-mode. Also you need to apply a patch attached at the end of this message to your KVM. git://kemari.git.sourceforge.net/gitroot/kemari/kemari In addition to usual migrate environment and command, add "-k" to run. The patch set consists of following components. - bit-based dirty bitmap. (I have posted v4 for upstream QEMU on April 2o) - writev() support to QEMUFile and FdMigrationState. - FT transaction sender/receiver - event tap that triggers FT transaction. - virtio-blk, virtio-net support for event tap. Makefile.objs | 1 + buffered_file.c | 2 +- configure | 8 + cpu-all.h | 134 ++++++++++++++++- cutils.c | 12 ++ exec.c | 127 +++++++++++++---- ft_transaction.c | 423 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ ft_transaction.h | 57 ++++++++ hw/hw.h | 25 ++++ hw/virtio-blk.c | 2 + hw/virtio-net.c | 2 + migration-exec.c | 2 +- migration-fd.c | 2 +- migration-tcp.c | 58 +++++++- migration-unix.c | 2 +- migration.c | 146 ++++++++++++++++++- migration.h | 8 + osdep.c | 13 ++ qemu-char.c | 25 +++- qemu-common.h | 21 +++ qemu-kvm.c | 26 ++-- qemu-monitor.hx | 7 +- qemu_socket.h | 4 + savevm.c | 264 ++++++++++++++++++++++++++++++---- sysemu.h | 3 +- vl.c | 221 +++++++++++++++++++++++++--- 26 files changed, 1474 insertions(+), 121 deletions(-) create mode 100644 ft_transaction.c create mode 100644 ft_transaction.h The rest of this message describes TODO lists grouped by each topic. === event tapping === Event tapping is the core component of Kemari, and it decides on which event the primary should synchronize with the secondary. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP. As discussed in the following thread, we may need to reconsider how and when to start VM synchronization. http://www.mail-archive.com/kvm@xxxxxxxxxxxxxxx/msg31908.html We would like get as much feedbacks on current implementation before thinking/going into the next approach. TODO: - virtio polling - support for asynchronous I/O methods (eventfd) === sender / receiver === To synchronize virtual machines, all the dirty pages since the last synchronization point and the state of the VCPU the virtual devices is sent to the fallback node from the user-space QEMU process. TODO: - Asynchronous VM transfer / pipelining (needed for SMP) - Zero copy VM transfer - VM transfer w/ RDMA === storage === Although Kemari needs some kind of shared storage, many users don't like it and they expect to use Kemari in conjunction with software storage replication. TODO: - Integration with other non-shared disk cluster storage solutions such as DRBD (might need changes to guarantee storage data consistency at Kemari synchronization points). - Integration with QEMU's block live migration functionality for non-share disk configurations. === integration with HA stack (Pacemaker/Corosync) === Failover process kicks in whenever a failure in the primary node is detected. For Kemari for Xen, we already have finished RA for Heartbeat, and planning to integrate Kemari for KVM with the new HA stacks (Pacemaker, RHCS, etc). Ideally, we would like to leverage the hardware failure detection capabilities of newish x86 hardware to trigger failover, the idea being that transferring control to the fallback node proactively when a problem is detected is much faster than relying on the polling mechanisms used by most HA software. TODO: - RA for Pacemaker. - Consider both HW failure and SW failure scenarios (failover between Kemari clusters). - Make the necessary changes to Pacemaker/Corosync to support event(HW failure, etc)-driven failover. - Take advantage of the RAS capabilities of newer CPUs/motherboards such as MCE to trigger failover. - Detect failures in I/O devices (block I/O errors, etc). === clock === Since synchronizing the virtual machines every time the TSC is accessed would be prohibitive, the transmission of the TSC will be done lazily, which means delaying it until there is a non-TSC synchronization point arrives. TODO: - Synchronization of clock sources (need to intercept TSC reads, etc). === usability === These are items that defines how users interact with Kemari. TODO: - Kemarid daemon that takes care of the cluster management/monitoring side of things. - Some device emulators might need minor modifications to work well with Kemari. Use white(black)-listing to take the burden of choosing the right device model off the users. === optimizations === Although the big picture can be realized by completing the TODO list above, we need some optimizations/enhancements to make Kemari useful in real world, and these are items what needs to be done for that. TODO: - SMP (for the sake of performance might need to implement a synchronization protocol that can maintain two or more synchronization points active at any given moment) - VGA (leverage VNC's subtilting mechanism to identify fb pages that are really dirty). Any comments/suggestions would be greatly appreciated. Thanks, Yoshi -- Kemari starts synchronizing VMs when QEMU handles I/O requests. Without this patch VCPU state is already proceeded before synchronization, and after failover to the VM on the receiver, it hangs because of this. Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@xxxxxxxxxxxxx> --- arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/svm.c | 11 ++++++++--- arch/x86/kvm/vmx.c | 11 ++++++++--- arch/x86/kvm/x86.c | 4 ++++ 4 files changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 26c629a..7b8f514 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -227,6 +227,7 @@ struct kvm_pio_request { int in; int port; int size; + bool lazy_skip; }; /* diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d04c7ad..e373245 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm *svm) { struct kvm_vcpu *vcpu = &svm->vcpu; u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */ - int size, in, string; + int size, in, string, ret; unsigned port; ++svm->vcpu.stat.io_exits; @@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm *svm) port = io_info >> 16; size = (io_info & SVM_IOIO_SIZE_MASK) >> SVM_IOIO_SIZE_SHIFT; svm->next_rip = svm->vmcb->control.exit_info_2; - skip_emulated_instruction(&svm->vcpu); - return kvm_fast_pio_out(vcpu, size, port); + ret = kvm_fast_pio_out(vcpu, size, port); + if (ret) + skip_emulated_instruction(&svm->vcpu); + else + vcpu->arch.pio.lazy_skip = true; + + return ret; } static int nmi_interception(struct vcpu_svm *svm) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 41e63bb..09052d6 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu *vcpu) static int handle_io(struct kvm_vcpu *vcpu) { unsigned long exit_qualification; - int size, in, string; + int size, in, string, ret; unsigned port; exit_qualification = vmcs_readl(EXIT_QUALIFICATION); @@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu) port = exit_qualification >> 16; size = (exit_qualification & 7) + 1; - skip_emulated_instruction(vcpu); - return kvm_fast_pio_out(vcpu, size, port); + ret = kvm_fast_pio_out(vcpu, size, port); + if (ret) + skip_emulated_instruction(vcpu); + else + vcpu->arch.pio.lazy_skip = true; + + return ret; } static void diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index fd5c3d3..cc308d2 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) if (!irqchip_in_kernel(vcpu->kvm)) kvm_set_cr8(vcpu, kvm_run->cr8); + if (vcpu->arch.pio.lazy_skip) + kvm_x86_ops->skip_emulated_instruction(vcpu); + vcpu->arch.pio.lazy_skip = false; + if (vcpu->arch.pio.count || vcpu->mmio_needed || vcpu->arch.emulate_ctxt.restart) { if (vcpu->mmio_needed) { -- 1.7.0.31.g1df487 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html