https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Index.md # KubeVirt and the KVM user space This is the entry point to a series of documents which, together, detail the current status of KubeVirt and how it interacts with the KVM user space. The intended audience is people who are familiar with the traditional virtualization stack (QEMU plus libvirt), and in order to make it more approachable to them comparisons, are included and little to no knowledge of KubeVirt or Kubernetes is assumed. Each section contains a short summary as well as a link to a separate document discussing the topic in more detail, with the intention that readers will be able to quickly get a high-level understading of the various topics by reading this overview document and then dig further into the specific areas they're interested in. ## Architecture ### Goals * KubeVirt aims to feel completely native to Kubernetes users * VMs should behave like containers whenever possible * There should be no features that are limited to VMs when it would make sense to implement them for containers too * KubeVirt also aims to support all the workloads that traditional virtualization can handle * Windows support, device assignment etc. are all fair game * When these two goals clash, integration with Kubernetes usually wins ### Components * KubeVirt is made up of various discrete components that interact with Kubernetes and the KVM user space * The overall design is somewhat similar to that of libvirt, except with a much higher granularity and many of the tasks offloaded to Kubernetes * Some of the components run at the cluster level or host level with very high privileges, others run at the pod level with significantly reduced privileges Additional information: [Components][] ### Runtime environment * QEMU expects its environment to be set up in advance, something that is typically taken care of by libvirt * libvirtd, when not running in session mode, assumes that it has root-level access to the system and can perform pretty much any privileged operation * In Kubernetes, the runtime environment is usually heavily locked down and many privileged operations are not permitted * Requiring additional permissions for VMs goes against the goal, mentioned earlier, to have VMs behave the same as containers whenever possible ## Specific areas ### Hotplug * QEMU supports hotplug (and hot-unplug) of most devices, and its use is extremely common * Conversely, resources associated with containers such as storage volumes, network interfaces and CPU shares are allocated upfront and do not change throughout the life of the workload * If the container needs more (or less) resources, the Kubernetes approach is to destroy the existing one and schedule a new one to take over its role Additional information: [Hotplug][] ### Storage * Handled through the same Kubernetes APIs used for containers * QEMU / libvirt only see an image file and don't have direct access to the underlying storage implementation * This makes certain scenarios that are common in the virtualization world very challenging: examples include hotplug and full VM snapshots (storage plus memory) * It might be possible to remove some of these limitations by changing the way storage is exposed to QEMU, or even take advantage of the storage technologies that QEMU already implements and make them available to containers in addition to VMs. Additional information: [Storage][] ### Networking * Application processes running in VMs are hidden behind a network interface as opposed to local sockets and processes running in a separated user namespace * Service meshes proxy and monitor applications by means of socket redirection and classification on local ports and process identifiers. We need to aim for generic compatibility * Existing solutions replicate a full TCP/IP stack to pretend applications running in a QEMU instance are local. No chances for zero-copy and context switching avoidance * Network connectivity is shared between control plane and workload itself. Addressing and port mapping need particular attention * Linux capabilities granted to the pod might be minimal, or none at all. Live migration presents further challenges in terms of network addressing and port mapping Additional information: [Networking][] ### Live migration * QEMU supports live migration between hosts, usually coordinated by libvirt * Kubernetes expects containers to be disposable, so the equivalent of live migration would be to simply destroy the ones running on the source node and schedule replacements on the destination node * For KubeVirt, a hybrid approach is used: a new container is created on the target node, then the VM is migrated from the old container, running on the source node, to the newly-created one Additional information: [Live migration][] ### CPU pinning * CPU pinning is not handled by QEMU directly, but is instead delegated to libvirt * KubeVirt figures out which CPUs are assigned to the container after it has been started by Kubernetes, and passes that information to libvirt so that it can perform CPU pinning Additional information: [CPU pinning][] ### NUMA pinning * NUMA pinning is not handled by QEMU directly, but is instead delegated to libvirt * KubeVirt doesn't implement NUMA pinning at the moment Additional information: [NUMA pinning][] ### Isolation * For security reasons, it's a good idea to run each QEMU process in an environment that is isolated from the host as well as other VMs * This includes using a separate unprivileged user account, setting up namespaces and cgroups, using SELinux... * QEMU doesn't take care of this itself and delegates it to libvirt * Most of these techniques serve as the base for containers, so KubeVirt can rely on Kubernetes providing a similar level of isolation without further intervention Additional information: [Isolation][] ## Other tidbits ### Upgrades * When libvirt is upgraded, running VMs keep using the old QEMU binary: the new QEMU binary is used for newly-started VMs as well as after VMs have been power cycled or migrated * KubeVirt behaves similarly, with the old version of libvirt and QEMU remaining in use for running VMs Additional information [Upgrades][] ### Backpropagation * Applications using libvirt usually don't provide all information, eg. a full PCI topology, and let libvirt fill in the blanks * This might require a second step where the additional information is collected and stored along with the original one * Backpropagation doesn't fit well in Kubernetes' declarative model, so KubeVirt doesn't currently perform it Additional information: [Backpropagation][] ## Contacts and credits This information was collected and organized by many people at Red Hat, some of which have agreed to serve as point of contacts for follow-up discussion. Additional information: [Contacts][] [Backpropagation]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Backpropagation.md [CPU pinning]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/CPU-Pinning.md [Components]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md [Contacts]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Contacts.md [Hotplug]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Hotplug.md [Isolation]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Isolation.md [Live migration]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Live-Migration.md [NUMA pinning]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/NUMA-Pinning.md [Networking]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md [Storage]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md [Upgrades]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Upgrades.md