[RFC DOCUMENT 01/12] kubevirt-and-kvm: Add Index page

Andrea Bolognani <abologna@xxxxxxxxxx> · Wed, 16 Sep 2020 18:45:37 +0200

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Index.md

# KubeVirt and the KVM user space

This is the entry point to a series of documents which, together,
detail the current status of KubeVirt and how it interacts with the
KVM user space.

The intended audience is people who are familiar with the traditional
virtualization stack (QEMU plus libvirt), and in order to make it
more approachable to them comparisons, are included and little to no
knowledge of KubeVirt or Kubernetes is assumed.

Each section contains a short summary as well as a link to a separate
document discussing the topic in more detail, with the intention that
readers will be able to quickly get a high-level understading of the
various topics by reading this overview document and then dig further
into the specific areas they're interested in.

## Architecture

### Goals

* KubeVirt aims to feel completely native to Kubernetes users
  * VMs should behave like containers whenever possible
  * There should be no features that are limited to VMs when it would
    make sense to implement them for containers too
* KubeVirt also aims to support all the workloads that traditional
  virtualization can handle
  * Windows support, device assignment etc. are all fair game
* When these two goals clash, integration with Kubernetes usually
  wins

### Components

* KubeVirt is made up of various discrete components that interact
  with Kubernetes and the KVM user space
  * The overall design is somewhat similar to that of libvirt, except
    with a much higher granularity and many of the tasks offloaded to
    Kubernetes
  * Some of the components run at the cluster level or host level
    with very high privileges, others run at the pod level with
    significantly reduced privileges

Additional information: [Components][]

### Runtime environment

* QEMU expects its environment to be set up in advance, something
  that is typically taken care of by libvirt
* libvirtd, when not running in session mode, assumes that it has
  root-level access to the system and can perform pretty much any
  privileged operation
* In Kubernetes, the runtime environment is usually heavily locked
  down and many privileged operations are not permitted
  * Requiring additional permissions for VMs goes against the goal,
    mentioned earlier, to have VMs behave the same as containers
    whenever possible

## Specific areas

### Hotplug

* QEMU supports hotplug (and hot-unplug) of most devices, and its use
  is extremely common
* Conversely, resources associated with containers such as storage
  volumes, network interfaces and CPU shares are allocated upfront
  and do not change throughout the life of the workload
  * If the container needs more (or less) resources, the Kubernetes
    approach is to destroy the existing one and schedule a new one to
    take over its role

Additional information: [Hotplug][]

### Storage

* Handled through the same Kubernetes APIs used for containers
  * QEMU / libvirt only see an image file and don't have direct
    access to the underlying storage implementation
  * This makes certain scenarios that are common in the
    virtualization world very challenging: examples include hotplug
    and full VM snapshots (storage plus memory)
* It might be possible to remove some of these limitations by
  changing the way storage is exposed to QEMU, or even take advantage
  of the storage technologies that QEMU already implements and make
  them available to containers in addition to VMs.

Additional information: [Storage][]

### Networking

* Application processes running in VMs are hidden behind a network
  interface as opposed to local sockets and processes running in
  a separated user namespace
  * Service meshes proxy and monitor applications by means of
    socket redirection and classification on local ports and
    process identifiers. We need to aim for generic compatibility
  * Existing solutions replicate a full TCP/IP stack to pretend
    applications running in a QEMU instance are local. No chances
    for zero-copy and context switching avoidance
* Network connectivity is shared between control plane and workload
  itself. Addressing and port mapping need particular attention
* Linux capabilities granted to the pod might be minimal, or none
  at all. Live migration presents further challenges in terms of
  network addressing and port mapping

Additional information: [Networking][]

### Live migration

* QEMU supports live migration between hosts, usually coordinated by
  libvirt
* Kubernetes expects containers to be disposable, so the equivalent
  of live migration would be to simply destroy the ones running on
  the source node and schedule replacements on the destination node
* For KubeVirt, a hybrid approach is used: a new container is created
  on the target node, then the VM is migrated from the old container,
  running on the source node, to the newly-created one

Additional information: [Live migration][]

### CPU pinning

* CPU pinning is not handled by QEMU directly, but is instead
  delegated to libvirt
* KubeVirt figures out which CPUs are assigned to the container after
  it has been started by Kubernetes, and passes that information to
  libvirt so that it can perform CPU pinning

Additional information: [CPU pinning][]

### NUMA pinning

* NUMA pinning is not handled by QEMU directly, but is instead
  delegated to libvirt
* KubeVirt doesn't implement NUMA pinning at the moment

Additional information: [NUMA pinning][]

### Isolation

* For security reasons, it's a good idea to run each QEMU process in
  an environment that is isolated from the host as well as other VMs
  * This includes using a separate unprivileged user account, setting
    up namespaces and cgroups, using SELinux...
  * QEMU doesn't take care of this itself and delegates it to libvirt
* Most of these techniques serve as the base for containers, so
  KubeVirt can rely on Kubernetes providing a similar level of
  isolation without further intervention

Additional information: [Isolation][]

## Other tidbits

### Upgrades

* When libvirt is upgraded, running VMs keep using the old QEMU
  binary: the new QEMU binary is used for newly-started VMs as well
  as after VMs have been power cycled or migrated
* KubeVirt behaves similarly, with the old version of libvirt and
  QEMU remaining in use for running VMs

Additional information [Upgrades][]

### Backpropagation

* Applications using libvirt usually don't provide all information,
  eg. a full PCI topology, and let libvirt fill in the blanks
  * This might require a second step where the additional information
    is collected and stored along with the original one
* Backpropagation doesn't fit well in Kubernetes' declarative model,
  so KubeVirt doesn't currently perform it

Additional information: [Backpropagation][]

## Contacts and credits

This information was collected and organized by many people at Red
Hat, some of which have agreed to serve as point of contacts for
follow-up discussion.

Additional information: [Contacts][]

[Backpropagation]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Backpropagation.md
[CPU pinning]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/CPU-Pinning.md
[Components]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Components.md
[Contacts]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Contacts.md
[Hotplug]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Hotplug.md
[Isolation]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Isolation.md
[Live migration]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Live-Migration.md
[NUMA pinning]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/NUMA-Pinning.md
[Networking]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
[Storage]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md
[Upgrades]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Upgrades.md