Hi all, It has been a while coming, but we have finally started work on Kemari's port to KVM. For those not familiar with it, Kemari provides the basic building block to create a virtualization-based fault tolerant machine: a virtual machine synchronization mechanism. Traditional high availability solutions can be classified in two groups: fault tolerant servers, and software clustering. Broadly speaking, fault tolerant servers protect us against hardware failures and, generally, rely on redundant hardware (often proprietary), and hardware failure detection to trigger fail-over. On the other hand, software clustering, as its name indicates, takes care of software failures and usually requires a standby server whose software configuration for the part we are trying to make fault tolerant must be identical to that of the active server. Both solutions may be applied to virtualized environments. Indeed, the current incarnation of Kemari (Xen-based) brings fault tolerant server-like capabilities to virtual machines and integration with existing HA stacks (Heartbeat, RHCS, etc) is under consideration. After some time in the drawing board we completed the basic design of Kemari for KVM, so we are sending an RFC at this point to get early feedback and, hopefully, get things right from the start. Those already familiar with Kemari and/or fault tolerance may want to skip the "Background" and go directly to the design and implementation bits. This is a pretty long write-up, but please bear with me. == Background == We started to play around with continuous virtual synchronization technology about 3 years ago. As development progressed and, most importantly, we got the first Xen-based working prototypes it became clear that we needed a proper name for our toy: Kemari. The goal of Kemari is to provide a fault tolerant platform for virtualization environments, so that in the event of a hardware failure the virtual machine fails over from compromised to properly operating hardware (a physical machine) in a way that is completely transparent to the guest operating system. Although hardware based fault tolerant servers and HA servers (software clustering) have been around for a (long) while, they typically require specifically designed hardware and/or modifications to applications. In contrast, by abstracting hardware using virtualization, Kemari can be used on off-the-shelf hardware and no application modifications are needed. After a period of in-house development the first version of Kemari for Xen was released in Nov 2008 as open source. However, by then it was already pretty clear that a KVM port would have several advantages. First, KVM is integrated into the Linux kernel, which means one gets support for a wide variety of hardware for free. Second, and in the same vein, KVM can also benefit from Linux' low latency networking capabilities including RDMA, which is of paramount importance for a extremely latency-sensitive functionality like Kemari. Last, but not the least, KVM and its community is growing rapidly, and there is increasing demand for Kemari-like functionality for KVM. Although the basic design principles will remain the same, our plan is to write Kemari for KVM from scratch, since there does not seem to be much opportunity for sharing between Xen and KVM. == Design outline == The basic premise of fault tolerant servers is that when things go awry with the hardware the running system should transparently continue execution on an alternate physical host. For this to be possible the state of the fallback host has to be identical to that of the primary. Kemari runs paired virtual machines in an active-passive configuration and achieves whole-system replication by continuously copying the state of the system (dirty pages and the state of the virtual devices) from the active node to the passive node. An interesting implication of this is that during normal operation only the active node is actually executing code. Another possible approach is to run a pair of systems in lock-step (à la VMware FT). Since both the primary and fallback virtual machines are active keeping them synchronized is a complex task, which usually involves carefully injecting external events into both virtual machines so that they result in identical states. The latter approach is extremely architecture specific and not SMP friendly. This spurred us to try the design that became Kemari, which we believe lends itself to further optimizations. == Implementation == The first step is to encapsulate the machine to be protected within a virtual machine. Then the live migration functionality is leveraged to keep the virtual machines synchronized. Whereas during live migration dirty pages can be sent asynchronously from the primary to the fallback server until the ratio of dirty pages is low enough to guarantee very short downtimes, when it comes to fault tolerance solutions whenever a synchronization point is reached changes to the virtual machine since the previous one have to be sent synchronously. Since the virtual machine has to be stopped until the data reaches and is acknowledged by the fallback server, the synchronization model is of critical importance for performance (both in terms of raw throughput and latencies). The model chosen for Kemari along with other implementation details is described below. * Synchronization model The synchronization points were carefully chosen to minimize the amount of traffic that goes over the wire while still maintaining the FT pair consistent at all times. To be precise, Kemari uses events that modify externally visible state as synchronizations points. This means that all outgoing I/O needs to be trapped and sent to the fallback host before the primary is resumed, so that it can be replayed in the face of hardware failure. The basic assumption here is that outgoing I/O operations are idempotent, which is usually true for disk I/O and reliable network protocols such as TCP (Kemari may trigger hidden bugs on applications that use UDP or other unreliable protocols, so those may need minor changes to ensure they work properly after failover). The synchronization process can be broken down as follows: - Event tapping: On KVM all I/O generates a VMEXIT that is synchronously handled by the Linux kernel monitor i.e. KVM (it is worth noting that this applies to virtio devices too, because they use MMIO and PIO just like a regular PCI device). - VCPU/Guest freezing: This is automatic in the UP case. On SMP environments we may need to send a IPI to stop the other VCPUs. - Notification to qemu: Taking a page from live migration's playbook, the synchronization process is user-space driven, which means that qemu needs to be woken up at each synchronization point. That is already the case for qemu-emulated devices, but we also have in-kernel emulators. To compound the problem, even for user-space emulated devices accesses to coalesced MMIO areas can not be detected. As a consequence we need a mechanism to communicate KVM-handled events to qemu. The channel for KVM-qemu communication can be easily build upon the existing infrastructure. We just need to add a new a page to the kvm_run shared memory area that can be mmapped from user space and set the exit reason appropriately. Regarding in-kernel device emulators, we only need to care about writes. Specifically, making kvm_io_bus_write() fail when Kemari is activated and invoking the emulator again after re-entrance from user space should suffice (this is somewhat similar to what we do in kvm_arch_vcpu_ioctl_run() for MMIO reads). To avoid missing synchronization points one should be careful with coalesced MMIO-like optimizations. In the particular case of coalesced MMIO, the I/O operation that caused the exit to user space should act as a write barrier when it was due to an access to a non-coalesced MMIO area. This means that before proceeding to handle the exit in kvm_run() we have to make sure that all the coalesced MMIO has reached the fallback host. - Virtual machine synchronization: All the dirty pages since the last synchronization point and the state of the virtual devices is sent to the fallback node from the user-space qemu process. For this the existing savevm infrastructure and KVM's dirty page tracking capabilities can be reused. Regarding in-kernel devices, with the likely advent of in-kernel virtio backends we need a generic way to access their state from user-space, for which, again, the kvm_run share memory area could be used. - Virtual machine run: Execution of the virtual machine is resumed as soon as synchronization finishes. * Clock Even though we do not need to worry about the clock that provides the tick (the counter resides in memory, which we keep synchronized), the same does not apply to counters such as the TSC (we certainly want to avoid a situation where counters jump back in time right after fail-over, breaking guarantees such as monotonicity). To avoid big hiccups after migration the value of the TSC should be sent to the fallback node frequently. An access from the guest (through RDTSC, RDTSCP, RDMSR, or WRMSR) seems like the right moment to do this. Fortunately, both vmx and SVM provide controls to intercept accesses to the TSC, so it is just a matter of setting those appropriately ("RDTSC exiting" VM-execution control, and RDTSC, RDTSCP, RDMSR, WRMSR instruction intercepts, respectively). However, since synchronizing the virtual machines every time the TSC is accessed would be prohibitive, the transmission of the TSC will be done lazily, which means delaying it until there is a non-TSC synchronization point arrives. * Failover Failover process kicks in whenever a failure in the primary node is detected. At the time of writing we just ping the virtual machine periodically to determine whether it is still alive, but in the long term we have plans to integrate Kemari with the major HA stacks (Hearbeat, RHCS, etc). Ideally, we would like to leverage the hardware failure detection capabilities of newish x86 hardware to trigger failover, the idea being that transferring control to the fallback node proactively when a problem is detected is much faster than relying on the polling mechanisms used by most HA software. Finally, to restore the virtual machine in the fallback host the loadvm infrastructure used for live-migration is leveraged. * Further information Please visit the link below for additional information, including documentation and, most importantly, source code (for Xen only at the moment). http://www.osrg.net/kemari == Any comments and suggestions would be greatly appreciated. If this is the right forum and people on the KVM mailing list do not mind, we would like to use the CC'ed mailing lists for Kemari development. Having more expert eyes looking at one's code always helps. Thanks, Fernando -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html