[RFC DOCUMENT 04/12] kubevirt-and-kvm: Add Storage page

Andrea Bolognani <abologna@xxxxxxxxxx> · Wed, 16 Sep 2020 18:48:39 +0200

https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md

# Storage

This document describes the known use-cases and architecture options
we have for Linux Virtualization storage in [KubeVirt][].

## Problem description

The main goal of Kubevirt is to leverage the storage subsystem of
Kubernetes (built around [CSI][] and [Persistent Volumes][] aka PVs),
in order to let both workloads (VMs and containers) leverage the same
storage. As a consequence Kubevirt is limited in its use of QEMU
storage subsystem and features. That means:

* Storage solutions should be implemented in k8s in a way that can be
  consumed by both containers and VMs.
* VMs can only consume (and provide) storage features which are
  available in the pod, through k8s APIs. For example, a VM will not
  support disk snapshots if it’s attached to a storage provider that
  doesn’t support it. Ditto for incremental backup, block jobs,
  encryption, etc.

## Current situation

### Storage handled outside of QEMU

In this scenario, the VM pod uses a [Persistent Volume Claim
(PVC)][Persistent Volumes] to give QEMU access to a raw storage
device or fs mount, which is provided by a [CSI][] driver. QEMU
**doesn’t** handle any of the storage use-cases such as thin
provisioning, snapshots, change block tracking, block jobs, etc.

This is how things work today in KubeVirt.

![Storage handled outside of QEMU][Storage-Current]

Devices and interfaces:

* PVC: block or fs
* QEMU backend: raw device or raw image
* QEMU frontend: virtio-blk
  * alternative: emulated device for wider compatibility and Windows
    installations
  * CDROM (sata)
  * disk (sata)

Pros:

* Simplicity
* Sharing the same storage model with other pods/containers

Cons:

* Limited feature-set (fully off-loaded to the storage provider from
  CSI).
* No VM snapshots (disk + memory)
* Limited opportunities for fine-tuning and optimizations for
  high-performance.
* Hotplug is challenging, because the set of PVCs in a pod is
  immutable.

Questions and comments

* How to optimize this in QEMU?
  * Can we bypass the block layer for this use-case? Like having SPDK
    inside the VM pod?
    * Rust-based storage daemon (e.g. [vhost_user_block][]) running
      inside the VM pod alongside QEMU (bypassing the block layer)
  * We should be able to achieve high-performance with local NVME
    storage here, with multiple polling IOThreads and multi queue.
* See [this blog post][PVC resize blog] for information about the PVC
  resize feature.  To implement this for VMs we could have kubevirt
  watch PVCs and respond to capacity changes with a corresponding
  call to resize the image file (if applicable) and to notify qemu of
  the enlarged device.
* Features such as incremental backup (CBT) and snapshots could be
  implemented through a generic CSI backend... Device mapper?
  Stratis? (See [Other Topics](#other-topics))

## Possible alternatives

### Storage device passthrough (highest performance)

Device passthrough via PCI VFIO, SCSI, or vDPA. No storage use-cases
and no CSI, as the device is passed directly to the guest.

![Storage device passthrough][Storage-Passthrough]

Devices and interfaces:

* N/A (hardware passthrough)

Pros:

* Highest possible performance (same as host)

Cons:

* No storage features anywhere outside of the guest.
* No live-migration for most cases.

### File-system passthrough (virtio-fs)

File mount volumes (directories, actually) can be exposed to QEMU via
[virtio-fs][] so that VMs have access to files and directories.

![File-system passthrough (virtio-fs)][Storage-Virtiofs]

Devices and interfaces:

* PVC: file-system

Pros:

* Simplicity from the user-perspective
* Flexibility
* Great for heterogeneous workloads that share data between
  containers and VMs (ie. OpenShift pipelines)

Cons:

* Performance when compared to block device passthrough

Questions and comments:

* Feature is still quite new (The Windows driver is fresh out of the
  oven)

### QEMU storage daemon in CSI for local storage

The qemu-storage-daemon is a user-space daemon that exposes QEMU’s
block layer to external users. It’s similar to [SPDK][], but includes
the implementation of QEMU block layer features such as snapshots and
bitmap tracking for incremental backup (CBT). It also allows the
splitting of one single NVMe device, allowing multiple QEMU VMs to
share one NVMe disk.

In this architecture, the storage daemon runs as part of CSI (control
plane), with the data-plane being either a vhost-user-blk interface
for QEMU or a fs-mount export for containers.

![QEMU storage daemon in CSI for local storage][Storage-QSD]

Devices and interfaces:

* CSI:
  * fs mount with a vhost-user-blk socket for QEMU to open
  * (OR) fs mount via NBD or FUSE with the actual file-system
    contents
* qemu-storage-daemon backend: NVMe local device w/ raw or qcow2
  * alternative: any driver supported by QEMU, such as file-posix.
* QEMU frontend: virtio-blk
  * alternative: any emulated device (CDROM, virtio-scsi, etc)
  * In this case QEMU itself would be consuming vhost-user-blk and
    emulating the device for the guest

Pros:

* The NVMe driver from the storage daemon can support partitioning
  one NVMe device into multiple blk devices, each shared via a
  vhost-user-blk connection.
* Rich feature set, exposing features already implemented in the QEMU
  block layer to regular pods/containers:
  * Snapshots and thin-provisioning (qcow2)
  * Incremental Backup (CBT)
* Compatibility with use-cases from other projects (oVirt, OpenStack,
  etc)
  * Snapshots, thin-provisioning, CBT and block jobs via QEMU

Cons:

* Complexity due to cascading and splitting of components.
* Depends on the evolution of CSI APIs to provide the right
  use-cases.

Questions and comments:

* Local restrictions: QEMU and qemu-storage-daemon should be running
  on the same host (for vhost-user-blk shared memory to work).
* Need to cascade CSI providers for volume management (resize,
  creation, etc)
* How to share a partitioned NVMe device (from one storage daemon)
  with multiple pods?
* See also: [kubevirt/kubevirt#3208][] (similar idea for
  vhost-user-net).
* We could do hotplugging under the hood with the storage daemon.
  * To expose a new PV, a new qemu-storage-daemon pod can be created
    with a corresponding PVC. Conversely, on unplug, the pod can be
    deleted. Ideally, we might have a 1:1 relationship between PVs
    and storage daemon pods (though 1:n for attaching multiple guests
    to a single daemon).
  * This requires that we can create a new unix socket connection
    from new storage daemon pods to the VMs. The exact way to achieve
    this is still to be figured out. According to Adam Litke, the
    naive way would require elevated privileges for both pods.
  * After having the socket (either the file or a file descriptor)
    available in the VM pod, QEMU can connect to it.
* In order to avoid a mix of block devices having a PVC in the VM pod
  and others where we just passed the unix socket, we can completely
  avoid the PVC case for the VM pod:
  * For exposing a PV to QEMU, we would always go through the storage
    daemon (i.e. the PVC moves from the VM pod to the storage daemon
    pod), so the VM pod always only gets a unix socket connection,
    unifying the two cases.
  * Using vhost-user-blk from the storage daemon pod performs the
    same (or potentially better if this allows for polling that we
    wouldn’t have done otherwise) as having a PVC directly in the VM
    pod, so while it looks like an indirection, the actual I/O path
    would be comparable.
  * This architecture would also allow using the native
    Gluster/Ceph/NBD/… block drivers in the QEMU process without
    making them special (because they wouldn’t use a PVC either),
    unifying even more cases.
  * Kubernetes has fairly low per-node Pod limits by default so we
    may need to be careful about 1:1 Pod/PVC mapping.  We may want to
    support aggregation of multiple storage connections into a single
    q-s-d Pod.

## Other topics

### Device Mapper

Another possibility is to leverage the device-mapper from Linux to
provide features such as snapshots and even like Incremental Backup.
For example, [dm-era][] seems to provide the basic primitives for
bitmap tracking.

This could be part of scenario number 1, or cascaded with other PVs
somewhere else.

Is this already being used? For example, [cybozu-go/topolvm][] is a
CSI LVM Plugin for k8s.

### Stratis

[Stratis][] seems to be an interesting project to be leveraged in the
world of Kubernetes.

### vhost-user-blk in other CSI backends

Would it make sense for other CSI backends to implement support for
vhost-user-blk?

[CSI]: https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/
[KubeVirt]: https://kubevirt.io/
[PVC resize blog]: 
https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/
[Persistent Volumes]: https://kubernetes.io/docs/concepts/storage/persistent-volumes/
[SPDK]: https://spdk.io/
[Storage-Current]: 
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Current.png
[Storage-Passthrough]: 
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Passthrough.png
[Storage-QSD]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-QSD.png
[Storage-Virtiofs]: 
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Virtiofs.png
[Stratis]: https://stratis-storage.github.io/
[cybozu-go/topolvm]: https://github.com/cybozu-go/topolvm
[dm-era]: https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/era.html
[kubevirt/kubevirt#3208]: https://github.com/kubevirt/kubevirt/pull/3208
[vhost_user_block]: 
https://github.com/cloud-hypervisor/cloud-hypervisor/tree/master/vhost_user_block
[virtio-fs]: https://virtio-fs.gitlab.io/