On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote: > On 2018年04月02日 23:23, Tiwei Bie wrote: > > This patch introduces a mdev (mediated device) based hardware > > vhost backend. This backend is an abstraction of the various > > hardware vhost accelerators (potentially any device that uses > > virtio ring can be used as a vhost accelerator). Some generic > > mdev parent ops are provided for accelerator drivers to support > > generating mdev instances. > > > > What's this > > =========== > > > > The idea is that we can setup a virtio ring compatible device > > with the messages available at the vhost-backend. Originally, > > these messages are used to implement a software vhost backend, > > but now we will use these messages to setup a virtio ring > > compatible hardware device. Then the hardware device will be > > able to work with the guest virtio driver in the VM just like > > what the software backend does. That is to say, we can implement > > a hardware based vhost backend in QEMU, and any virtio ring > > compatible devices potentially can be used with this backend. > > (We also call it vDPA -- vhost Data Path Acceleration). > > > > One problem is that, different virtio ring compatible devices > > may have different device interfaces. That is to say, we will > > need different drivers in QEMU. It could be troublesome. And > > that's what this patch trying to fix. The idea behind this > > patch is very simple: mdev is a standard way to emulate device > > in kernel. > > So you just move the abstraction layer from qemu to kernel, and you still > need different drivers in kernel for different device interfaces of > accelerators. This looks even more complex than leaving it in qemu. As you > said, another idea is to implement userspace vhost backend for accelerators > which seems easier and could co-work with other parts of qemu without > inventing new type of messages. I'm not quite sure. Do you think it's acceptable to add various vendor specific hardware drivers in QEMU? > > Need careful thought here to seek a best solution here. Yeah, definitely! :) And your opinions would be very helpful! > > > So we defined a standard device based on mdev, which > > is able to accept vhost messages. When the mdev emulation code > > (i.e. the generic mdev parent ops provided by this patch) gets > > vhost messages, it will parse and deliver them to accelerator > > drivers. Drivers can use these messages to setup accelerators. > > > > That is to say, the generic mdev parent ops (e.g. read()/write()/ > > ioctl()/...) will be provided for accelerator drivers to register > > accelerators as mdev parent devices. And each accelerator device > > will support generating standard mdev instance(s). > > > > With this standard device interface, we will be able to just > > develop one userspace driver to implement the hardware based > > vhost backend in QEMU. > > > > Difference between vDPA and PCI passthru > > ======================================== > > > > The key difference between vDPA and PCI passthru is that, in > > vDPA only the data path of the device (e.g. DMA ring, notify > > region and queue interrupt) is pass-throughed to the VM, the > > device control path (e.g. PCI configuration space and MMIO > > regions) is still defined and emulated by QEMU. > > > > The benefits of keeping virtio device emulation in QEMU compared > > with virtio device PCI passthru include (but not limit to): > > > > - consistent device interface for guest OS in the VM; > > - max flexibility on the hardware design, especially the > > accelerator for each vhost backend doesn't have to be a > > full PCI device; > > - leveraging the existing virtio live-migration framework; > > > > The interface of this mdev based device > > ======================================= > > > > 1. BAR0 > > > > The MMIO region described by BAR0 is the main control > > interface. Messages will be written to or read from > > this region. > > > > The message type is determined by the `request` field > > in message header. The message size is encoded in the > > message header too. The message format looks like this: > > > > struct vhost_vfio_op { > > __u64 request; > > __u32 flags; > > /* Flag values: */ > > #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */ > > __u32 size; > > union { > > __u64 u64; > > struct vhost_vring_state state; > > struct vhost_vring_addr addr; > > struct vhost_memory memory; > > } payload; > > }; > > > > The existing vhost-kernel ioctl cmds are reused as > > the message requests in above structure. > > > > Each message will be written to or read from this > > region at offset 0: > > > > int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op) > > { > > int count = VHOST_VFIO_OP_HDR_SIZE + op->size; > > struct vhost_vfio *vfio = dev->opaque; > > int ret; > > > > ret = pwrite64(vfio->device_fd, op, count, vfio->bar0_offset); > > if (ret != count) > > return -1; > > > > return 0; > > } > > > > int vhost_vfio_read(struct vhost_dev *dev, struct vhost_vfio_op *op) > > { > > int count = VHOST_VFIO_OP_HDR_SIZE + op->size; > > struct vhost_vfio *vfio = dev->opaque; > > uint64_t request = op->request; > > int ret; > > > > ret = pread64(vfio->device_fd, op, count, vfio->bar0_offset); > > if (ret != count || request != op->request) > > return -1; > > > > return 0; > > } > > > > It's quite straightforward to set things to the device. > > Just need to write the message to device directly: > > > > int vhost_vfio_set_features(struct vhost_dev *dev, uint64_t features) > > { > > struct vhost_vfio_op op; > > > > op.request = VHOST_SET_FEATURES; > > op.flags = 0; > > op.size = sizeof(features); > > op.payload.u64 = features; > > > > return vhost_vfio_write(dev, &op); > > } > > > > To get things from the device, two steps are needed. > > Take VHOST_GET_FEATURE as an example: > > > > int vhost_vfio_get_features(struct vhost_dev *dev, uint64_t *features) > > { > > struct vhost_vfio_op op; > > int ret; > > > > op.request = VHOST_GET_FEATURES; > > op.flags = VHOST_VFIO_NEED_REPLY; > > op.size = 0; > > > > /* Just need to write the header */ > > ret = vhost_vfio_write(dev, &op); > > if (ret != 0) > > goto out; > > > > /* `op` wasn't changed during write */ > > op.flags = 0; > > op.size = sizeof(*features); > > > > ret = vhost_vfio_read(dev, &op); > > if (ret != 0) > > goto out; > > > > *features = op.payload.u64; > > out: > > return ret; > > } > > > > 2. BAR1 (mmap-able) > > > > The MMIO region described by BAR1 will be used to notify the > > device. > > > > Each queue will has a page for notification, and it can be > > mapped to VM (if hardware also supports), and the virtio > > driver in the VM will be able to notify the device directly. > > > > The MMIO region described by BAR1 is also write-able. If the > > accelerator's notification register(s) cannot be mapped to the > > VM, write() can also be used to notify the device. Something > > like this: > > > > void notify_relay(void *opaque) > > { > > ...... > > offset = 0x1000 * queue_idx; /* XXX assume page size is 4K here. */ > > > > ret = pwrite64(vfio->device_fd, &queue_idx, sizeof(queue_idx), > > vfio->bar1_offset + offset); > > ...... > > } > > > > Other BARs are reserved. > > > > 3. VFIO interrupt ioctl API > > > > VFIO interrupt ioctl API is used to setup device interrupts. > > IRQ-bypass will also be supported. > > > > Currently, only VFIO_PCI_MSIX_IRQ_INDEX is supported. > > > > The API for drivers to provide mdev instances > > ============================================= > > > > The read()/write()/ioctl()/mmap()/open()/release() mdev > > parent ops have been provided for accelerators' drivers > > to provide mdev instances. > > > > ssize_t vdpa_read(struct mdev_device *mdev, char __user *buf, > > size_t count, loff_t *ppos); > > ssize_t vdpa_write(struct mdev_device *mdev, const char __user *buf, > > size_t count, loff_t *ppos); > > long vdpa_ioctl(struct mdev_device *mdev, unsigned int cmd, unsigned long arg); > > int vdpa_mmap(struct mdev_device *mdev, struct vm_area_struct *vma); > > int vdpa_open(struct mdev_device *mdev); > > void vdpa_close(struct mdev_device *mdev); > > > > Each accelerator driver just needs to implement its own > > create()/remove() ops, and provide a vdpa device ops > > which will be called by the generic mdev emulation code. > > > > Currently, the vdpa device ops are defined as: > > > > typedef int (*vdpa_start_device_t)(struct vdpa_dev *vdpa); > > typedef int (*vdpa_stop_device_t)(struct vdpa_dev *vdpa); > > typedef int (*vdpa_dma_map_t)(struct vdpa_dev *vdpa); > > typedef int (*vdpa_dma_unmap_t)(struct vdpa_dev *vdpa); > > typedef int (*vdpa_set_eventfd_t)(struct vdpa_dev *vdpa, int vector, int fd); > > typedef u64 (*vdpa_supported_features_t)(struct vdpa_dev *vdpa); > > typedef void (*vdpa_notify_device_t)(struct vdpa_dev *vdpa, int qid); > > typedef u64 (*vdpa_get_notify_addr_t)(struct vdpa_dev *vdpa, int qid); > > > > struct vdpa_device_ops { > > vdpa_start_device_t start; > > vdpa_stop_device_t stop; > > vdpa_dma_map_t dma_map; > > vdpa_dma_unmap_t dma_unmap; > > vdpa_set_eventfd_t set_eventfd; > > vdpa_supported_features_t supported_features; > > vdpa_notify_device_t notify; > > vdpa_get_notify_addr_t get_notify_addr; > > }; > > > > struct vdpa_dev { > > struct mdev_device *mdev; > > struct mutex ops_lock; > > u8 vconfig[VDPA_CONFIG_SIZE]; > > int nr_vring; > > u64 features; > > u64 state; > > struct vhost_memory *mem_table; > > bool pending_reply; > > struct vhost_vfio_op pending; > > const struct vdpa_device_ops *ops; > > void *private; > > int max_vrings; > > struct vdpa_vring_info vring_info[0]; > > }; > > > > struct vdpa_dev *vdpa_alloc(struct mdev_device *mdev, void *private, > > int max_vrings); > > void vdpa_free(struct vdpa_dev *vdpa); > > > > A simple example > > ================ > > > > # Query the number of available mdev instances > > $ cat /sys/class/mdev_bus/0000:06:00.2/mdev_supported_types/ifcvf_vdpa-vdpa_virtio/available_instances > > > > # Create a mdev instance > > $ echo $UUID > /sys/class/mdev_bus/0000:06:00.2/mdev_supported_types/ifcvf_vdpa-vdpa_virtio/create > > > > # Launch QEMU with a virtio-net device > > $ qemu \ > > ...... \ > > -netdev type=vhost-vfio,sysfsdev=/sys/bus/mdev/devices/$UUID,id=$ID \ > > -device virtio-net-pci,netdev=$ID > > > > -------- END -------- > > > > Most of above words will be refined and moved to a doc in > > the formal patch. In this RFC, all introductions and code > > are gathered in this patch, the idea is to make it easier > > to find all the relevant information. Anyone who wants to > > comment could use inline comment and just keep the relevant > > parts. Sorry for the big RFC patch.. > > > > This patch is just a RFC for now, and something is still > > missing or needs to be refined. But it's never too early > > to hear the thoughts from the community. So any comments > > would be appreciated! Thanks! :-) > > I don't see vhost_vfio_write() and other above functions in the patch. Looks > like some part of the patch is missed, it would be better to post a complete > series with an example driver (vDPA) to get a full picture. No problem. We will send out the QEMU changes soon! Thanks! > > Thanks > [...]