Oops, I placed the subject in the wrong place. Best regards, Maxim Levitsky On Tue, 2019-03-19 at 16:41 +0200, Maxim Levitsky wrote: > Date: Tue, 19 Mar 2019 14:45:45 +0200 > Subject: [PATCH 0/9] RFC: NVME VFIO mediated device > > Hi everyone! > > In this patch series, I would like to introduce my take on the problem of > doing > as fast as possible virtualization of storage with emphasis on low latency. > > In this patch series I implemented a kernel vfio based, mediated device that > allows the user to pass through a partition and/or whole namespace to a guest. > > The idea behind this driver is based on paper you can find at > https://www.usenix.org/conference/atc18/presentation/peng, > > Although note that I stared the development prior to reading this paper, > independently. > > In addition to that implementation is not based on code used in the paper as > I wasn't being able at that time to make the source available to me. > > ***Key points about the implementation:*** > > * Polling kernel thread is used. The polling is stopped after a > predefined timeout (1/2 sec by default). > Support for all interrupt driven mode is planned, and it shows promising > results. > > * Guest sees a standard NVME device - this allows to run guest with > unmodified drivers, for example windows guests. > > * The NVMe device is shared between host and guest. > That means that even a single namespace can be split between host > and guest based on different partitions. > > * Simple configuration > > *** Performance *** > > Performance was tested on Intel DC P3700, With Xeon E5-2620 v2 > and both latency and throughput is very similar to SPDK. > > Soon I will test this on a better server and nvme device and provide > more formal performance numbers. > > Latency numbers: > ~80ms - spdk with fio plugin on the host. > ~84ms - nvme driver on the host > ~87ms - mdev-nvme + nvme driver in the guest > > Throughput was following similar pattern as well. > > * Configuration example > $ modprobe nvme mdev_queues=4 > $ modprobe nvme-mdev > > $ UUID=$(uuidgen) > $ DEVICE='device pci address' > $ echo $UUID > /sys/bus/pci/devices/$DEVICE/mdev_supported_types/nvme- > 2Q_V1/create > $ echo n1p3 > /sys/bus/mdev/devices/$UUID/namespaces/add_namespace #attach > host namespace 1 parition 3 > $ echo 11 > /sys/bus/mdev/devices/$UUID/settings/iothread_cpu #pin the io > thread to cpu 11 > > Afterward boot qemu with > -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID > > Zero configuration on the guest. > > *** FAQ *** > > * Why to make this in the kernel? Why this is better that SPDK > > -> Reuse the existing nvme kernel driver in the host. No new drivers in the > guest. > > -> Share the NVMe device between host and guest. > Even in fully virtualized configurations, > some partitions of nvme device could be used by guests as block devices > while others passed through with nvme-mdev to achieve balance between > all features of full IO stack emulation and performance. > > -> NVME-MDEV is a bit faster due to the fact that in-kernel driver > can send interrupts to the guest directly without a context > switch that can be expensive due to meltdown mitigation. > > -> Is able to utilize interrupts to get reasonable performance. > This is only implemented > as a proof of concept and not included in the patches, > but interrupt driven mode shows reasonable performance > > -> This is a framework that later can be used to support NVMe devices > with more of the IO virtualization built-in > (IOMMU with PASID support coupled with device that supports it) > > * Why to attach directly to nvme-pci driver and not use block layer IO > -> The direct attachment allows for better performance, but I will > check the possibility of using block IO, especially for fabrics drivers. > > *** Implementation notes *** > > * All guest memory is mapped into the physical nvme device > but not 1:1 as vfio-pci would do this. > This allows very efficient DMA. > To support this, patch 2 adds ability for a mdev device to listen on > guest's memory map events. > Any such memory is immediately pinned and then DMA mapped. > (Support for fabric drivers where this is not possible exits too, > in which case the fabric driver will do its own DMA mapping) > > * nvme core driver is modified to announce the appearance > and disappearance of nvme controllers and namespaces, > to which the nvme-mdev driver is subscribed. > > * nvme-pci driver is modified to expose raw interface of attaching to > and sending/polling the IO queues. > This allows the mdev driver very efficiently to submit/poll for the IO. > By default one host queue is used per each mediated device. > (support for other fabric based host drivers is planned) > > * The nvme-mdev doesn't assume presence of KVM, thus any VFIO user, including > SPDK, a qemu running with tccg, ... can use this virtual device. > > *** Testing *** > > The device was tested with stock QEMU 3.0 on the host, > with host was using 5.0 kernel with nvme-mdev added and the following > hardware: > * QEMU nvme virtual device (with nested guest) > * Intel DC P3700 on Xeon E5-2620 v2 server > * Samsung SM981 (in a Thunderbolt enclosure, with my laptop) > * Lenovo NVME device found in my laptop > > The guest was tested with kernel 4.16, 4.18, 4.20 and > the same custom complied kernel 5.0 > Windows 10 guest was tested too with both Microsoft's inbox driver and > open source community NVME driver > (https://lists.openfabrics.org/pipermail/nvmewin/2016-December/001420.html) > > Testing was mostly done on x86_64, but 32 bit host/guest combination > was lightly tested too. > > In addition to that, the virtual device was tested with nested guest, > by passing the virtual device to it, > using pci passthrough, qemu userspace nvme driver, and spdk > > > PS: I used to contribute to the kernel as a hobby using the > maximlevitsky@xxxxxxxxx address > > Maxim Levitsky (9): > vfio/mdev: add .request callback > nvme/core: add some more values from the spec > nvme/core: add NVME_CTRL_SUSPENDED controller state > nvme/pci: use the NVME_CTRL_SUSPENDED state > nvme/pci: add known admin effects to augument admin effects log page > nvme/pci: init shadow doorbell after each reset > nvme/core: add mdev interfaces > nvme/core: add nvme-mdev core driver > nvme/pci: implement the mdev external queue allocation interface > > MAINTAINERS | 5 + > drivers/nvme/Kconfig | 1 + > drivers/nvme/Makefile | 1 + > drivers/nvme/host/core.c | 149 +++++- > drivers/nvme/host/nvme.h | 55 ++- > drivers/nvme/host/pci.c | 385 ++++++++++++++- > drivers/nvme/mdev/Kconfig | 16 + > drivers/nvme/mdev/Makefile | 5 + > drivers/nvme/mdev/adm.c | 873 ++++++++++++++++++++++++++++++++++ > drivers/nvme/mdev/events.c | 142 ++++++ > drivers/nvme/mdev/host.c | 491 +++++++++++++++++++ > drivers/nvme/mdev/instance.c | 802 +++++++++++++++++++++++++++++++ > drivers/nvme/mdev/io.c | 563 ++++++++++++++++++++++ > drivers/nvme/mdev/irq.c | 264 ++++++++++ > drivers/nvme/mdev/mdev.h | 56 +++ > drivers/nvme/mdev/mmio.c | 591 +++++++++++++++++++++++ > drivers/nvme/mdev/pci.c | 247 ++++++++++ > drivers/nvme/mdev/priv.h | 700 +++++++++++++++++++++++++++ > drivers/nvme/mdev/udata.c | 390 +++++++++++++++ > drivers/nvme/mdev/vcq.c | 207 ++++++++ > drivers/nvme/mdev/vctrl.c | 514 ++++++++++++++++++++ > drivers/nvme/mdev/viommu.c | 322 +++++++++++++ > drivers/nvme/mdev/vns.c | 356 ++++++++++++++ > drivers/nvme/mdev/vsq.c | 178 +++++++ > drivers/vfio/mdev/vfio_mdev.c | 11 + > include/linux/mdev.h | 4 + > include/linux/nvme.h | 88 +++- > 27 files changed, 7375 insertions(+), 41 deletions(-) > create mode 100644 drivers/nvme/mdev/Kconfig > create mode 100644 drivers/nvme/mdev/Makefile > create mode 100644 drivers/nvme/mdev/adm.c > create mode 100644 drivers/nvme/mdev/events.c > create mode 100644 drivers/nvme/mdev/host.c > create mode 100644 drivers/nvme/mdev/instance.c > create mode 100644 drivers/nvme/mdev/io.c > create mode 100644 drivers/nvme/mdev/irq.c > create mode 100644 drivers/nvme/mdev/mdev.h > create mode 100644 drivers/nvme/mdev/mmio.c > create mode 100644 drivers/nvme/mdev/pci.c > create mode 100644 drivers/nvme/mdev/priv.h > create mode 100644 drivers/nvme/mdev/udata.c > create mode 100644 drivers/nvme/mdev/vcq.c > create mode 100644 drivers/nvme/mdev/vctrl.c > create mode 100644 drivers/nvme/mdev/viommu.c > create mode 100644 drivers/nvme/mdev/vns.c > create mode 100644 drivers/nvme/mdev/vsq.c >