https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Storage.md # Storage This document describes the known use-cases and architecture options we have for Linux Virtualization storage in [KubeVirt][]. ## Problem description The main goal of Kubevirt is to leverage the storage subsystem of Kubernetes (built around [CSI][] and [Persistent Volumes][] aka PVs), in order to let both workloads (VMs and containers) leverage the same storage. As a consequence Kubevirt is limited in its use of QEMU storage subsystem and features. That means: * Storage solutions should be implemented in k8s in a way that can be consumed by both containers and VMs. * VMs can only consume (and provide) storage features which are available in the pod, through k8s APIs. For example, a VM will not support disk snapshots if it’s attached to a storage provider that doesn’t support it. Ditto for incremental backup, block jobs, encryption, etc. ## Current situation ### Storage handled outside of QEMU In this scenario, the VM pod uses a [Persistent Volume Claim (PVC)][Persistent Volumes] to give QEMU access to a raw storage device or fs mount, which is provided by a [CSI][] driver. QEMU **doesn’t** handle any of the storage use-cases such as thin provisioning, snapshots, change block tracking, block jobs, etc. This is how things work today in KubeVirt. ![Storage handled outside of QEMU][Storage-Current] Devices and interfaces: * PVC: block or fs * QEMU backend: raw device or raw image * QEMU frontend: virtio-blk * alternative: emulated device for wider compatibility and Windows installations * CDROM (sata) * disk (sata) Pros: * Simplicity * Sharing the same storage model with other pods/containers Cons: * Limited feature-set (fully off-loaded to the storage provider from CSI). * No VM snapshots (disk + memory) * Limited opportunities for fine-tuning and optimizations for high-performance. * Hotplug is challenging, because the set of PVCs in a pod is immutable. Questions and comments * How to optimize this in QEMU? * Can we bypass the block layer for this use-case? Like having SPDK inside the VM pod? * Rust-based storage daemon (e.g. [vhost_user_block][]) running inside the VM pod alongside QEMU (bypassing the block layer) * We should be able to achieve high-performance with local NVME storage here, with multiple polling IOThreads and multi queue. * See [this blog post][PVC resize blog] for information about the PVC resize feature. To implement this for VMs we could have kubevirt watch PVCs and respond to capacity changes with a corresponding call to resize the image file (if applicable) and to notify qemu of the enlarged device. * Features such as incremental backup (CBT) and snapshots could be implemented through a generic CSI backend... Device mapper? Stratis? (See [Other Topics](#other-topics)) ## Possible alternatives ### Storage device passthrough (highest performance) Device passthrough via PCI VFIO, SCSI, or vDPA. No storage use-cases and no CSI, as the device is passed directly to the guest. ![Storage device passthrough][Storage-Passthrough] Devices and interfaces: * N/A (hardware passthrough) Pros: * Highest possible performance (same as host) Cons: * No storage features anywhere outside of the guest. * No live-migration for most cases. ### File-system passthrough (virtio-fs) File mount volumes (directories, actually) can be exposed to QEMU via [virtio-fs][] so that VMs have access to files and directories. ![File-system passthrough (virtio-fs)][Storage-Virtiofs] Devices and interfaces: * PVC: file-system Pros: * Simplicity from the user-perspective * Flexibility * Great for heterogeneous workloads that share data between containers and VMs (ie. OpenShift pipelines) Cons: * Performance when compared to block device passthrough Questions and comments: * Feature is still quite new (The Windows driver is fresh out of the oven) ### QEMU storage daemon in CSI for local storage The qemu-storage-daemon is a user-space daemon that exposes QEMU’s block layer to external users. It’s similar to [SPDK][], but includes the implementation of QEMU block layer features such as snapshots and bitmap tracking for incremental backup (CBT). It also allows the splitting of one single NVMe device, allowing multiple QEMU VMs to share one NVMe disk. In this architecture, the storage daemon runs as part of CSI (control plane), with the data-plane being either a vhost-user-blk interface for QEMU or a fs-mount export for containers. ![QEMU storage daemon in CSI for local storage][Storage-QSD] Devices and interfaces: * CSI: * fs mount with a vhost-user-blk socket for QEMU to open * (OR) fs mount via NBD or FUSE with the actual file-system contents * qemu-storage-daemon backend: NVMe local device w/ raw or qcow2 * alternative: any driver supported by QEMU, such as file-posix. * QEMU frontend: virtio-blk * alternative: any emulated device (CDROM, virtio-scsi, etc) * In this case QEMU itself would be consuming vhost-user-blk and emulating the device for the guest Pros: * The NVMe driver from the storage daemon can support partitioning one NVMe device into multiple blk devices, each shared via a vhost-user-blk connection. * Rich feature set, exposing features already implemented in the QEMU block layer to regular pods/containers: * Snapshots and thin-provisioning (qcow2) * Incremental Backup (CBT) * Compatibility with use-cases from other projects (oVirt, OpenStack, etc) * Snapshots, thin-provisioning, CBT and block jobs via QEMU Cons: * Complexity due to cascading and splitting of components. * Depends on the evolution of CSI APIs to provide the right use-cases. Questions and comments: * Local restrictions: QEMU and qemu-storage-daemon should be running on the same host (for vhost-user-blk shared memory to work). * Need to cascade CSI providers for volume management (resize, creation, etc) * How to share a partitioned NVMe device (from one storage daemon) with multiple pods? * See also: [kubevirt/kubevirt#3208][] (similar idea for vhost-user-net). * We could do hotplugging under the hood with the storage daemon. * To expose a new PV, a new qemu-storage-daemon pod can be created with a corresponding PVC. Conversely, on unplug, the pod can be deleted. Ideally, we might have a 1:1 relationship between PVs and storage daemon pods (though 1:n for attaching multiple guests to a single daemon). * This requires that we can create a new unix socket connection from new storage daemon pods to the VMs. The exact way to achieve this is still to be figured out. According to Adam Litke, the naive way would require elevated privileges for both pods. * After having the socket (either the file or a file descriptor) available in the VM pod, QEMU can connect to it. * In order to avoid a mix of block devices having a PVC in the VM pod and others where we just passed the unix socket, we can completely avoid the PVC case for the VM pod: * For exposing a PV to QEMU, we would always go through the storage daemon (i.e. the PVC moves from the VM pod to the storage daemon pod), so the VM pod always only gets a unix socket connection, unifying the two cases. * Using vhost-user-blk from the storage daemon pod performs the same (or potentially better if this allows for polling that we wouldn’t have done otherwise) as having a PVC directly in the VM pod, so while it looks like an indirection, the actual I/O path would be comparable. * This architecture would also allow using the native Gluster/Ceph/NBD/… block drivers in the QEMU process without making them special (because they wouldn’t use a PVC either), unifying even more cases. * Kubernetes has fairly low per-node Pod limits by default so we may need to be careful about 1:1 Pod/PVC mapping. We may want to support aggregation of multiple storage connections into a single q-s-d Pod. ## Other topics ### Device Mapper Another possibility is to leverage the device-mapper from Linux to provide features such as snapshots and even like Incremental Backup. For example, [dm-era][] seems to provide the basic primitives for bitmap tracking. This could be part of scenario number 1, or cascaded with other PVs somewhere else. Is this already being used? For example, [cybozu-go/topolvm][] is a CSI LVM Plugin for k8s. ### Stratis [Stratis][] seems to be an interesting project to be leveraged in the world of Kubernetes. ### vhost-user-blk in other CSI backends Would it make sense for other CSI backends to implement support for vhost-user-blk? [CSI]: https://kubernetes.io/blog/2019/01/15/container-storage-interface-ga/ [KubeVirt]: https://kubevirt.io/ [PVC resize blog]: https://kubernetes.io/blog/2018/07/12/resizing-persistent-volumes-using-kubernetes/ [Persistent Volumes]: https://kubernetes.io/docs/concepts/storage/persistent-volumes/ [SPDK]: https://spdk.io/ [Storage-Current]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Current.png [Storage-Passthrough]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Passthrough.png [Storage-QSD]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-QSD.png [Storage-Virtiofs]: https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Storage-Virtiofs.png [Stratis]: https://stratis-storage.github.io/ [cybozu-go/topolvm]: https://github.com/cybozu-go/topolvm [dm-era]: https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/era.html [kubevirt/kubevirt#3208]: https://github.com/kubevirt/kubevirt/pull/3208 [vhost_user_block]: https://github.com/cloud-hypervisor/cloud-hypervisor/tree/master/vhost_user_block [virtio-fs]: https://virtio-fs.gitlab.io/