On Thu, Dec 24, 2020 at 11:01 AM Jason Wang <jasowang@xxxxxxxxxx> wrote: > > > On 2020/12/23 下午10:17, Yongji Xie wrote: > > On Wed, Dec 23, 2020 at 4:08 PM Jason Wang <jasowang@xxxxxxxxxx> wrote: > >> > >> On 2020/12/22 下午10:52, Xie Yongji wrote: > >>> This VDUSE driver enables implementing vDPA devices in userspace. > >>> Both control path and data path of vDPA devices will be able to > >>> be handled in userspace. > >>> > >>> In the control path, the VDUSE driver will make use of message > >>> mechnism to forward the config operation from vdpa bus driver > >>> to userspace. Userspace can use read()/write() to receive/reply > >>> those control messages. > >>> > >>> In the data path, the VDUSE driver implements a MMU-based on-chip > >>> IOMMU driver which supports mapping the kernel dma buffer to a > >>> userspace iova region dynamically. Userspace can access those > >>> iova region via mmap(). Besides, the eventfd mechanism is used to > >>> trigger interrupt callbacks and receive virtqueue kicks in userspace > >>> > >>> Now we only support virtio-vdpa bus driver with this patch applied. > >>> > >>> Signed-off-by: Xie Yongji <xieyongji@xxxxxxxxxxxxx> > >>> --- > >>> Documentation/driver-api/vduse.rst | 74 ++ > >>> Documentation/userspace-api/ioctl/ioctl-number.rst | 1 + > >>> drivers/vdpa/Kconfig | 8 + > >>> drivers/vdpa/Makefile | 1 + > >>> drivers/vdpa/vdpa_user/Makefile | 5 + > >>> drivers/vdpa/vdpa_user/eventfd.c | 221 ++++ > >>> drivers/vdpa/vdpa_user/eventfd.h | 48 + > >>> drivers/vdpa/vdpa_user/iova_domain.c | 442 ++++++++ > >>> drivers/vdpa/vdpa_user/iova_domain.h | 93 ++ > >>> drivers/vdpa/vdpa_user/vduse.h | 59 ++ > >>> drivers/vdpa/vdpa_user/vduse_dev.c | 1121 ++++++++++++++++++++ > >>> include/uapi/linux/vdpa.h | 1 + > >>> include/uapi/linux/vduse.h | 99 ++ > >>> 13 files changed, 2173 insertions(+) > >>> create mode 100644 Documentation/driver-api/vduse.rst > >>> create mode 100644 drivers/vdpa/vdpa_user/Makefile > >>> create mode 100644 drivers/vdpa/vdpa_user/eventfd.c > >>> create mode 100644 drivers/vdpa/vdpa_user/eventfd.h > >>> create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c > >>> create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h > >>> create mode 100644 drivers/vdpa/vdpa_user/vduse.h > >>> create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c > >>> create mode 100644 include/uapi/linux/vduse.h > >>> > >>> diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst > >>> new file mode 100644 > >>> index 000000000000..da9b3040f20a > >>> --- /dev/null > >>> +++ b/Documentation/driver-api/vduse.rst > >>> @@ -0,0 +1,74 @@ > >>> +================================== > >>> +VDUSE - "vDPA Device in Userspace" > >>> +================================== > >>> + > >>> +vDPA (virtio data path acceleration) device is a device that uses a > >>> +datapath which complies with the virtio specifications with vendor > >>> +specific control path. vDPA devices can be both physically located on > >>> +the hardware or emulated by software. VDUSE is a framework that makes it > >>> +possible to implement software-emulated vDPA devices in userspace. > >>> + > >>> +How VDUSE works > >>> +------------ > >>> +Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on > >>> +the VDUSE character device (/dev/vduse). Then a file descriptor pointing > >>> +to the new resources will be returned, which can be used to implement the > >>> +userspace vDPA device's control path and data path. > >>> + > >>> +To implement control path, the read/write operations to the file descriptor > >>> +will be used to receive/reply the control messages from/to VDUSE driver. > >>> +Those control messages are based on the vdpa_config_ops which defines a > >>> +unified interface to control different types of vDPA device. > >>> + > >>> +The following types of messages are provided by the VDUSE framework now: > >>> + > >>> +- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue. > >>> + > >>> +- VDUSE_SET_VQ_NUM: Set the size of virtqueue > >>> + > >>> +- VDUSE_SET_VQ_READY: Set ready status of virtqueue > >>> + > >>> +- VDUSE_GET_VQ_READY: Get ready status of virtqueue > >>> + > >>> +- VDUSE_SET_FEATURES: Set virtio features supported by the driver > >>> + > >>> +- VDUSE_GET_FEATURES: Get virtio features supported by the device > >>> + > >>> +- VDUSE_SET_STATUS: Set the device status > >>> + > >>> +- VDUSE_GET_STATUS: Get the device status > >>> + > >>> +- VDUSE_SET_CONFIG: Write to device specific configuration space > >>> + > >>> +- VDUSE_GET_CONFIG: Read from device specific configuration space > >>> + > >>> +Please see include/linux/vdpa.h for details. > >>> + > >>> +In the data path, VDUSE framework implements a MMU-based on-chip IOMMU > >>> +driver which supports mapping the kernel dma buffer to a userspace iova > >>> +region dynamically. The userspace iova region can be created by passing > >>> +the userspace vDPA device fd to mmap(2). > >>> + > >>> +Besides, the eventfd mechanism is used to trigger interrupt callbacks and > >>> +receive virtqueue kicks in userspace. The following ioctls on the userspace > >>> +vDPA device fd are provided to support that: > >>> + > >>> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used > >>> + by VDUSE driver to notify userspace to consume the vring. > >>> + > >>> +- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used > >>> + by userspace to notify VDUSE driver to trigger interrupt callbacks. > >>> + > >>> +MMU-based IOMMU Driver > >>> +---------------------- > >>> +The basic idea behind the IOMMU driver is treating MMU (VA->PA) as > >>> +IOMMU (IOVA->PA). This driver will set up MMU mapping instead of IOMMU mapping > >>> +for the DMA transfer so that the userspace process is able to use its virtual > >>> +address to access the dma buffer in kernel. > >>> + > >>> +And to avoid security issue, a bounce-buffering mechanism is introduced to > >>> +prevent userspace accessing the original buffer directly which may contain other > >>> +kernel data. During the mapping, unmapping, the driver will copy the data from > >>> +the original buffer to the bounce buffer and back, depending on the direction of > >>> +the transfer. And the bounce-buffer addresses will be mapped into the user address > >>> +space instead of the original one. > >>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst > >>> index a4c75a28c839..71722e6f8f23 100644 > >>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst > >>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst > >>> @@ -300,6 +300,7 @@ Code Seq# Include File Comments > >>> 'z' 10-4F drivers/s390/crypto/zcrypt_api.h conflict! > >>> '|' 00-7F linux/media.h > >>> 0x80 00-1F linux/fb.h > >>> +0x81 00-1F linux/vduse.h > >>> 0x89 00-06 arch/x86/include/asm/sockios.h > >>> 0x89 0B-DF linux/sockios.h > >>> 0x89 E0-EF linux/sockios.h SIOCPROTOPRIVATE range > >>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig > >>> index 4be7be39be26..211cc449cbd3 100644 > >>> --- a/drivers/vdpa/Kconfig > >>> +++ b/drivers/vdpa/Kconfig > >>> @@ -21,6 +21,14 @@ config VDPA_SIM > >>> to RX. This device is used for testing, prototyping and > >>> development of vDPA. > >>> > >>> +config VDPA_USER > >>> + tristate "VDUSE (vDPA Device in Userspace) support" > >>> + depends on EVENTFD && MMU && HAS_DMA > >>> + default n > >> > >> The "default n" is not necessary. > >> > > OK. > >>> + help > >>> + With VDUSE it is possible to emulate a vDPA Device > >>> + in a userspace program. > >>> + > >>> config IFCVF > >>> tristate "Intel IFC VF vDPA driver" > >>> depends on PCI_MSI > >>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile > >>> index d160e9b63a66..66e97778ad03 100644 > >>> --- a/drivers/vdpa/Makefile > >>> +++ b/drivers/vdpa/Makefile > >>> @@ -1,5 +1,6 @@ > >>> # SPDX-License-Identifier: GPL-2.0 > >>> obj-$(CONFIG_VDPA) += vdpa.o > >>> obj-$(CONFIG_VDPA_SIM) += vdpa_sim/ > >>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/ > >>> obj-$(CONFIG_IFCVF) += ifcvf/ > >>> obj-$(CONFIG_MLX5_VDPA) += mlx5/ > >>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile > >>> new file mode 100644 > >>> index 000000000000..b7645e36992b > >>> --- /dev/null > >>> +++ b/drivers/vdpa/vdpa_user/Makefile > >>> @@ -0,0 +1,5 @@ > >>> +# SPDX-License-Identifier: GPL-2.0 > >>> + > >>> +vduse-y := vduse_dev.o iova_domain.o eventfd.o > >> > >> Do we really need eventfd.o here consider we've selected it. > >> > > Do you mean the file "drivers/vdpa/vdpa_user/eventfd.c"? > > > My bad, I confuse this with the common eventfd. So the code is fine here. > > > > > >>> + > >>> +obj-$(CONFIG_VDPA_USER) += vduse.o > >>> diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c > >>> new file mode 100644 > >>> index 000000000000..dbffddb08908 > >>> --- /dev/null > >>> +++ b/drivers/vdpa/vdpa_user/eventfd.c > >>> @@ -0,0 +1,221 @@ > >>> +// SPDX-License-Identifier: GPL-2.0-only > >>> +/* > >>> + * Eventfd support for VDUSE > >>> + * > >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved. > >>> + * > >>> + * Author: Xie Yongji <xieyongji@xxxxxxxxxxxxx> > >>> + * > >>> + */ > >>> + > >>> +#include <linux/eventfd.h> > >>> +#include <linux/poll.h> > >>> +#include <linux/wait.h> > >>> +#include <linux/slab.h> > >>> +#include <linux/file.h> > >>> +#include <uapi/linux/vduse.h> > >>> + > >>> +#include "eventfd.h" > >>> + > >>> +static struct workqueue_struct *vduse_irqfd_cleanup_wq; > >>> + > >>> +static void vduse_virqfd_shutdown(struct work_struct *work) > >>> +{ > >>> + u64 cnt; > >>> + struct vduse_virqfd *virqfd = container_of(work, > >>> + struct vduse_virqfd, shutdown); > >>> + > >>> + eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt); > >>> + flush_work(&virqfd->inject); > >>> + eventfd_ctx_put(virqfd->ctx); > >>> + kfree(virqfd); > >>> +} > >>> + > >>> +static void vduse_virqfd_inject(struct work_struct *work) > >>> +{ > >>> + struct vduse_virqfd *virqfd = container_of(work, > >>> + struct vduse_virqfd, inject); > >>> + struct vduse_virtqueue *vq = virqfd->vq; > >>> + > >>> + spin_lock_irq(&vq->irq_lock); > >>> + if (vq->ready && vq->cb) > >>> + vq->cb(vq->private); > >>> + spin_unlock_irq(&vq->irq_lock); > >>> +} > >>> + > >>> +static void virqfd_deactivate(struct vduse_virqfd *virqfd) > >>> +{ > >>> + queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown); > >>> +} > >>> + > >>> +static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode, > >>> + int sync, void *key) > >>> +{ > >>> + struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait); > >>> + struct vduse_virtqueue *vq = virqfd->vq; > >>> + > >>> + __poll_t flags = key_to_poll(key); > >>> + > >>> + if (flags & EPOLLIN) > >>> + schedule_work(&virqfd->inject); > >>> + > >>> + if (flags & EPOLLHUP) { > >>> + spin_lock(&vq->irq_lock); > >>> + if (vq->virqfd == virqfd) { > >>> + vq->virqfd = NULL; > >>> + virqfd_deactivate(virqfd); > >>> + } > >>> + spin_unlock(&vq->irq_lock); > >>> + } > >>> + > >>> + return 0; > >>> +} > >>> + > >>> +static void vduse_virqfd_ptable_queue_proc(struct file *file, > >>> + wait_queue_head_t *wqh, poll_table *pt) > >>> +{ > >>> + struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt); > >>> + > >>> + add_wait_queue(wqh, &virqfd->wait); > >>> +} > >>> + > >>> +int vduse_virqfd_setup(struct vduse_dev *dev, > >>> + struct vduse_vq_eventfd *eventfd) > >>> +{ > >>> + struct vduse_virqfd *virqfd; > >>> + struct fd irqfd; > >>> + struct eventfd_ctx *ctx; > >>> + struct vduse_virtqueue *vq; > >>> + __poll_t events; > >>> + int ret; > >>> + > >>> + if (eventfd->index >= dev->vq_num) > >>> + return -EINVAL; > >>> + > >>> + vq = &dev->vqs[eventfd->index]; > >>> + virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL); > >>> + if (!virqfd) > >>> + return -ENOMEM; > >>> + > >>> + INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown); > >>> + INIT_WORK(&virqfd->inject, vduse_virqfd_inject); > >> > >> Any reason that a workqueue is must here? > >> > > Mainly for performance considerations. Make sure the push() and pop() > > for used vring can be asynchronous. > > > I see. > > > > > >>> + > >>> + ret = -EBADF; > >>> + irqfd = fdget(eventfd->fd); > >>> + if (!irqfd.file) > >>> + goto err_fd; > >>> + > >>> + ctx = eventfd_ctx_fileget(irqfd.file); > >>> + if (IS_ERR(ctx)) { > >>> + ret = PTR_ERR(ctx); > >>> + goto err_ctx; > >>> + } > >>> + > >>> + virqfd->vq = vq; > >>> + virqfd->ctx = ctx; > >>> + spin_lock(&vq->irq_lock); > >>> + if (vq->virqfd) > >>> + virqfd_deactivate(virqfd); > >>> + vq->virqfd = virqfd; > >>> + spin_unlock(&vq->irq_lock); > >>> + > >>> + init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup); > >>> + init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc); > >>> + > >>> + events = vfs_poll(irqfd.file, &virqfd->pt); > >>> + > >>> + /* > >>> + * Check if there was an event already pending on the eventfd > >>> + * before we registered and trigger it as if we didn't miss it. > >>> + */ > >>> + if (events & EPOLLIN) > >>> + schedule_work(&virqfd->inject); > >>> + > >>> + fdput(irqfd); > >>> + > >>> + return 0; > >>> +err_ctx: > >>> + fdput(irqfd); > >>> +err_fd: > >>> + kfree(virqfd); > >>> + return ret; > >>> +} > >>> + > >>> +void vduse_virqfd_release(struct vduse_dev *dev) > >>> +{ > >>> + int i; > >>> + > >>> + for (i = 0; i < dev->vq_num; i++) { > >>> + struct vduse_virtqueue *vq = &dev->vqs[i]; > >>> + > >>> + spin_lock(&vq->irq_lock); > >>> + if (vq->virqfd) { > >>> + virqfd_deactivate(vq->virqfd); > >>> + vq->virqfd = NULL; > >>> + } > >>> + spin_unlock(&vq->irq_lock); > >>> + } > >>> + flush_workqueue(vduse_irqfd_cleanup_wq); > >>> +} > >>> + > >>> +int vduse_virqfd_init(void) > >>> +{ > >>> + vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup", > >>> + WQ_UNBOUND, 0); > >>> + if (!vduse_irqfd_cleanup_wq) > >>> + return -ENOMEM; > >>> + > >>> + return 0; > >>> +} > >>> + > >>> +void vduse_virqfd_exit(void) > >>> +{ > >>> + destroy_workqueue(vduse_irqfd_cleanup_wq); > >>> +} > >>> + > >>> +void vduse_vq_kick(struct vduse_virtqueue *vq) > >>> +{ > >>> + spin_lock(&vq->kick_lock); > >>> + if (vq->ready && vq->kickfd) > >>> + eventfd_signal(vq->kickfd, 1); > >>> + spin_unlock(&vq->kick_lock); > >>> +} > >>> + > >>> +int vduse_kickfd_setup(struct vduse_dev *dev, > >>> + struct vduse_vq_eventfd *eventfd) > >>> +{ > >>> + struct eventfd_ctx *ctx; > >>> + struct vduse_virtqueue *vq; > >>> + > >>> + if (eventfd->index >= dev->vq_num) > >>> + return -EINVAL; > >>> + > >>> + vq = &dev->vqs[eventfd->index]; > >>> + ctx = eventfd_ctx_fdget(eventfd->fd); > >>> + if (IS_ERR(ctx)) > >>> + return PTR_ERR(ctx); > >>> + > >>> + spin_lock(&vq->kick_lock); > >>> + if (vq->kickfd) > >>> + eventfd_ctx_put(vq->kickfd); > >>> + vq->kickfd = ctx; > >>> + spin_unlock(&vq->kick_lock); > >>> + > >>> + return 0; > >>> +} > >>> + > >>> +void vduse_kickfd_release(struct vduse_dev *dev) > >>> +{ > >>> + int i; > >>> + > >>> + for (i = 0; i < dev->vq_num; i++) { > >>> + struct vduse_virtqueue *vq = &dev->vqs[i]; > >>> + > >>> + spin_lock(&vq->kick_lock); > >>> + if (vq->kickfd) { > >>> + eventfd_ctx_put(vq->kickfd); > >>> + vq->kickfd = NULL; > >>> + } > >>> + spin_unlock(&vq->kick_lock); > >>> + } > >>> +} > >>> diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h > >>> new file mode 100644 > >>> index 000000000000..14269ff27f47 > >>> --- /dev/null > >>> +++ b/drivers/vdpa/vdpa_user/eventfd.h > >>> @@ -0,0 +1,48 @@ > >>> +/* SPDX-License-Identifier: GPL-2.0-only */ > >>> +/* > >>> + * Eventfd support for VDUSE > >>> + * > >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved. > >>> + * > >>> + * Author: Xie Yongji <xieyongji@xxxxxxxxxxxxx> > >>> + * > >>> + */ > >>> + > >>> +#ifndef _VDUSE_EVENTFD_H > >>> +#define _VDUSE_EVENTFD_H > >>> + > >>> +#include <linux/eventfd.h> > >>> +#include <linux/poll.h> > >>> +#include <linux/wait.h> > >>> +#include <uapi/linux/vduse.h> > >>> + > >>> +#include "vduse.h" > >>> + > >>> +struct vduse_dev; > >>> + > >>> +struct vduse_virqfd { > >>> + struct eventfd_ctx *ctx; > >>> + struct vduse_virtqueue *vq; > >>> + struct work_struct inject; > >>> + struct work_struct shutdown; > >>> + wait_queue_entry_t wait; > >>> + poll_table pt; > >>> +}; > >>> + > >>> +int vduse_virqfd_setup(struct vduse_dev *dev, > >>> + struct vduse_vq_eventfd *eventfd); > >>> + > >>> +void vduse_virqfd_release(struct vduse_dev *dev); > >>> + > >>> +int vduse_virqfd_init(void); > >>> + > >>> +void vduse_virqfd_exit(void); > >>> + > >>> +void vduse_vq_kick(struct vduse_virtqueue *vq); > >>> + > >>> +int vduse_kickfd_setup(struct vduse_dev *dev, > >>> + struct vduse_vq_eventfd *eventfd); > >>> + > >>> +void vduse_kickfd_release(struct vduse_dev *dev); > >>> + > >>> +#endif /* _VDUSE_EVENTFD_H */ > >>> diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c > >>> new file mode 100644 > >>> index 000000000000..27022157abc6 > >>> --- /dev/null > >>> +++ b/drivers/vdpa/vdpa_user/iova_domain.c > >>> @@ -0,0 +1,442 @@ > >>> +// SPDX-License-Identifier: GPL-2.0-only > >>> +/* > >>> + * MMU-based IOMMU implementation > >>> + * > >>> + * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved. > >>> + * > >>> + * Author: Xie Yongji <xieyongji@xxxxxxxxxxxxx> > >>> + * > >>> + */ > >>> + > >>> +#include <linux/wait.h> > >>> +#include <linux/slab.h> > >>> +#include <linux/genalloc.h> > >>> +#include <linux/dma-mapping.h> > >>> + > >>> +#include "iova_domain.h" > >>> + > >>> +#define IOVA_CHUNK_SHIFT 26 > >>> +#define IOVA_CHUNK_SIZE (_AC(1, UL) << IOVA_CHUNK_SHIFT) > >>> +#define IOVA_CHUNK_MASK (~(IOVA_CHUNK_SIZE - 1)) > >>> + > >>> +#define IOVA_MIN_SIZE (IOVA_CHUNK_SIZE << 1) > >>> + > >>> +#define IOVA_ALLOC_ORDER 12 > >>> +#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER) > >>> + > >>> +struct vduse_mmap_vma { > >>> + struct vm_area_struct *vma; > >>> + struct list_head list; > >>> +}; > >>> + > >>> +static inline struct page * > >>> +vduse_domain_get_bounce_page(struct vduse_iova_domain *domain, > >>> + unsigned long iova) > >>> +{ > >>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT; > >>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK; > >>> + unsigned long pgindex = chunkoff >> PAGE_SHIFT; > >>> + > >>> + return domain->chunks[index].bounce_pages[pgindex]; > >>> +} > >>> + > >>> +static inline void > >>> +vduse_domain_set_bounce_page(struct vduse_iova_domain *domain, > >>> + unsigned long iova, struct page *page) > >>> +{ > >>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT; > >>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK; > >>> + unsigned long pgindex = chunkoff >> PAGE_SHIFT; > >>> + > >>> + domain->chunks[index].bounce_pages[pgindex] = page; > >>> +} > >>> + > >>> +static inline struct vduse_iova_map * > >>> +vduse_domain_get_iova_map(struct vduse_iova_domain *domain, > >>> + unsigned long iova) > >>> +{ > >>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT; > >>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK; > >>> + unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER; > >>> + > >>> + return domain->chunks[index].iova_map[mapindex]; > >>> +} > >>> + > >>> +static inline void > >>> +vduse_domain_set_iova_map(struct vduse_iova_domain *domain, > >>> + unsigned long iova, struct vduse_iova_map *map) > >>> +{ > >>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT; > >>> + unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK; > >>> + unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER; > >>> + > >>> + domain->chunks[index].iova_map[mapindex] = map; > >>> +} > >>> + > >>> +static int > >>> +vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain, > >>> + unsigned long iova, size_t size) > >>> +{ > >>> + struct page *page; > >>> + size_t walk_sz = 0; > >>> + int frees = 0; > >>> + > >>> + while (walk_sz < size) { > >>> + page = vduse_domain_get_bounce_page(domain, iova); > >>> + if (page) { > >>> + vduse_domain_set_bounce_page(domain, iova, NULL); > >>> + put_page(page); > >>> + frees++; > >>> + } > >>> + iova += PAGE_SIZE; > >>> + walk_sz += PAGE_SIZE; > >>> + } > >>> + > >>> + return frees; > >>> +} > >>> + > >>> +int vduse_domain_add_vma(struct vduse_iova_domain *domain, > >>> + struct vm_area_struct *vma) > >>> +{ > >>> + unsigned long size = vma->vm_end - vma->vm_start; > >>> + struct vduse_mmap_vma *mmap_vma; > >>> + > >>> + if (WARN_ON(size != domain->size)) > >>> + return -EINVAL; > >>> + > >>> + mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL); > >>> + if (!mmap_vma) > >>> + return -ENOMEM; > >>> + > >>> + mmap_vma->vma = vma; > >>> + mutex_lock(&domain->vma_lock); > >>> + list_add(&mmap_vma->list, &domain->vma_list); > >>> + mutex_unlock(&domain->vma_lock); > >>> + > >>> + return 0; > >>> +} > >>> + > >>> +void vduse_domain_remove_vma(struct vduse_iova_domain *domain, > >>> + struct vm_area_struct *vma) > >>> +{ > >>> + struct vduse_mmap_vma *mmap_vma; > >>> + > >>> + mutex_lock(&domain->vma_lock); > >>> + list_for_each_entry(mmap_vma, &domain->vma_list, list) { > >>> + if (mmap_vma->vma == vma) { > >>> + list_del(&mmap_vma->list); > >>> + kfree(mmap_vma); > >>> + break; > >>> + } > >>> + } > >>> + mutex_unlock(&domain->vma_lock); > >>> +} > >>> + > >>> +int vduse_domain_add_mapping(struct vduse_iova_domain *domain, > >>> + unsigned long iova, unsigned long orig, > >>> + size_t size, enum dma_data_direction dir) > >>> +{ > >>> + struct vduse_iova_map *map; > >>> + unsigned long last = iova + size; > >>> + > >>> + map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC); > >>> + if (!map) > >>> + return -ENOMEM; > >>> + > >>> + map->iova = iova; > >>> + map->orig = orig; > >>> + map->size = size; > >>> + map->dir = dir; > >>> + > >>> + while (iova < last) { > >>> + vduse_domain_set_iova_map(domain, iova, map); > >>> + iova += IOVA_ALLOC_SIZE; > >>> + } > >>> + > >>> + return 0; > >>> +} > >>> + > >>> +struct vduse_iova_map * > >>> +vduse_domain_get_mapping(struct vduse_iova_domain *domain, > >>> + unsigned long iova) > >>> +{ > >>> + return vduse_domain_get_iova_map(domain, iova); > >>> +} > >>> + > >>> +void vduse_domain_remove_mapping(struct vduse_iova_domain *domain, > >>> + struct vduse_iova_map *map) > >>> +{ > >>> + unsigned long iova = map->iova; > >>> + unsigned long last = iova + map->size; > >>> + > >>> + while (iova < last) { > >>> + vduse_domain_set_iova_map(domain, iova, NULL); > >>> + iova += IOVA_ALLOC_SIZE; > >>> + } > >>> +} > >>> + > >>> +void vduse_domain_unmap(struct vduse_iova_domain *domain, > >>> + unsigned long iova, size_t size) > >>> +{ > >>> + struct vduse_mmap_vma *mmap_vma; > >>> + unsigned long uaddr; > >>> + > >>> + mutex_lock(&domain->vma_lock); > >>> + list_for_each_entry(mmap_vma, &domain->vma_list, list) { > >>> + mmap_read_lock(mmap_vma->vma->vm_mm); > >>> + uaddr = iova + mmap_vma->vma->vm_start; > >>> + zap_page_range(mmap_vma->vma, uaddr, size); > >>> + mmap_read_unlock(mmap_vma->vma->vm_mm); > >>> + } > >>> + mutex_unlock(&domain->vma_lock); > >>> +} > >>> + > >>> +int vduse_domain_direct_map(struct vduse_iova_domain *domain, > >>> + struct vm_area_struct *vma, unsigned long iova) > >>> +{ > >>> + unsigned long uaddr = iova + vma->vm_start; > >>> + unsigned long start = iova & PAGE_MASK; > >>> + unsigned long last = start + PAGE_SIZE - 1; > >>> + unsigned long offset; > >>> + struct vduse_iova_map *map; > >>> + struct page *page = NULL; > >>> + > >>> + map = vduse_domain_get_iova_map(domain, iova); > >>> + if (map) { > >>> + offset = last - map->iova; > >>> + page = virt_to_page(map->orig + offset); > >>> + } > >>> + > >>> + return page ? vm_insert_page(vma, uaddr, page) : -EFAULT; > >>> +} > >> > >> So as we discussed before, we need to find way to make vhost work. And > >> it's better to make vhost transparent to VDUSE. One idea is to implement > >> shadow virtqueue here, that is, instead of trying to insert the pages to > >> VDUSE userspace, we use the shadow virtqueue to relay the descriptors to > >> userspace. With this, we don't need stuffs like shmfd etc. > >> > > Good idea! The disadvantage is performance will go down (one more > > thread switch overhead and vhost-liked kworker will become bottleneck > > without multi-thread support). > > > Yes, the disadvantage is the performance. But it should be simpler (I > guess) and we know it can succeed. > Yes, another advantage is that we can support the VM using anonymous memory. > > > > I think I can try this in v3. And the > > MMU-based IOMMU implementation can be a future optimization in the > > virtio-vdpa case. What's your opinion? > > > Maybe I was wrong, but I think we can try as what has been proposed here > first and use shadow virtqueue as backup plan if we fail. > OK, I will continue to work on this proposal. > > > > >>> + > >>> +void vduse_domain_bounce(struct vduse_iova_domain *domain, > >>> + unsigned long iova, unsigned long orig, > >>> + size_t size, enum dma_data_direction dir) > >>> +{ > >>> + unsigned int offset = offset_in_page(iova); > >>> + > >>> + while (size) { > >>> + struct page *p = vduse_domain_get_bounce_page(domain, iova); > >>> + size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size); > >>> + void *addr; > >>> + > >>> + if (p) { > >>> + addr = page_address(p) + offset; > >>> + if (dir == DMA_TO_DEVICE) > >>> + memcpy(addr, (void *)orig, copy_len); > >>> + else if (dir == DMA_FROM_DEVICE) > >>> + memcpy((void *)orig, addr, copy_len); > >>> + } > >> > >> I think I miss something, for DMA_FROM_DEVICE, if p doesn't exist how is > >> it expected to work? Or do we need to warn here in this case? > >> > > Yes, I think we need a WARN_ON here. > > > Ok. > > > > > > > >>> + size -= copy_len; > >>> + orig += copy_len; > >>> + iova += copy_len; > >>> + offset = 0; > >>> + } > >>> +} > >>> + > >>> +int vduse_domain_bounce_map(struct vduse_iova_domain *domain, > >>> + struct vm_area_struct *vma, unsigned long iova) > >>> +{ > >>> + unsigned long uaddr = iova + vma->vm_start; > >>> + unsigned long start = iova & PAGE_MASK; > >>> + unsigned long offset = 0; > >>> + bool found = false; > >>> + struct vduse_iova_map *map; > >>> + struct page *page; > >>> + > >>> + mutex_lock(&domain->map_lock); > >>> + > >>> + page = vduse_domain_get_bounce_page(domain, iova); > >>> + if (page) > >>> + goto unlock; > >>> + > >>> + page = alloc_page(GFP_KERNEL); > >>> + if (!page) > >>> + goto unlock; > >>> + > >>> + while (offset < PAGE_SIZE) { > >>> + unsigned int src_offset = 0, dst_offset = 0; > >>> + void *src, *dst; > >>> + size_t copy_len; > >>> + > >>> + map = vduse_domain_get_iova_map(domain, start + offset); > >>> + if (!map) { > >>> + offset += IOVA_ALLOC_SIZE; > >>> + continue; > >>> + } > >>> + > >>> + found = true; > >>> + offset += map->size; > >>> + if (map->dir == DMA_FROM_DEVICE) > >>> + continue; > >>> + > >>> + if (start > map->iova) > >>> + src_offset = start - map->iova; > >>> + else > >>> + dst_offset = map->iova - start; > >>> + > >>> + src = (void *)(map->orig + src_offset); > >>> + dst = page_address(page) + dst_offset; > >>> + copy_len = min_t(size_t, map->size - src_offset, > >>> + PAGE_SIZE - dst_offset); > >>> + memcpy(dst, src, copy_len); > >>> + } > >>> + if (!found) { > >>> + put_page(page); > >>> + page = NULL; > >>> + } > >>> + vduse_domain_set_bounce_page(domain, iova, page); > >>> +unlock: > >>> + mutex_unlock(&domain->map_lock); > >>> + > >>> + return page ? vm_insert_page(vma, uaddr, page) : -EFAULT; > >>> +} > >>> + > >>> +bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain, > >>> + unsigned long iova) > >>> +{ > >>> + unsigned long index = iova >> IOVA_CHUNK_SHIFT; > >>> + struct vduse_iova_chunk *chunk = &domain->chunks[index]; > >>> + > >>> + return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP; > >>> +} > >>> + > >>> +unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain, > >>> + size_t size, enum iova_map_type type) > >>> +{ > >>> + struct vduse_iova_chunk *chunk; > >>> + unsigned long iova = 0; > >>> + int align = (type == TYPE_DIRECT_MAP) ? PAGE_SIZE : IOVA_ALLOC_SIZE; > >>> + struct genpool_data_align data = { .align = align }; > >>> + int i; > >>> + > >>> + for (i = 0; i < domain->chunk_num; i++) { > >>> + chunk = &domain->chunks[i]; > >>> + if (unlikely(atomic_read(&chunk->map_type) == TYPE_NONE)) > >>> + atomic_cmpxchg(&chunk->map_type, TYPE_NONE, type); > >>> + > >>> + if (atomic_read(&chunk->map_type) != type) > >>> + continue; > >>> + > >>> + iova = gen_pool_alloc_algo(chunk->pool, size, > >>> + gen_pool_first_fit_align, &data); > >>> + if (iova) > >>> + break; > >>> + } > >>> + > >>> + return iova; > >> > >> I wonder why not just reuse the iova domain implements in > >> driver/iommu/iova.c > >> > > The iova domain in driver/iommu/iova.c is only an iova allocator which > > is implemented by the genpool memory allocator in our case. The other > > part in our iova domain is chunk management and iova_map management. > > We need different chunks to distinguish different dma mapping types: > > consistent mapping or streaming mapping. We can only use > > bouncing-mechanism in the streaming mapping case. > > > To differ dma mappings, you can use two iova domains with different > ranges. It looks simpler than the gen_pool. (AFAIK most IOMMU driver is > using iova domain). > OK, I see. Will do it in v3. Thanks, Yongji