Re: [RFC v2 06/13] vduse: Introduce VDUSE - vDPA Device in Userspace

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2020/12/23 下午10:17, Yongji Xie wrote:
On Wed, Dec 23, 2020 at 4:08 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:

On 2020/12/22 下午10:52, Xie Yongji wrote:
This VDUSE driver enables implementing vDPA devices in userspace.
Both control path and data path of vDPA devices will be able to
be handled in userspace.

In the control path, the VDUSE driver will make use of message
mechnism to forward the config operation from vdpa bus driver
to userspace. Userspace can use read()/write() to receive/reply
those control messages.

In the data path, the VDUSE driver implements a MMU-based on-chip
IOMMU driver which supports mapping the kernel dma buffer to a
userspace iova region dynamically. Userspace can access those
iova region via mmap(). Besides, the eventfd mechanism is used to
trigger interrupt callbacks and receive virtqueue kicks in userspace

Now we only support virtio-vdpa bus driver with this patch applied.

Signed-off-by: Xie Yongji <xieyongji@xxxxxxxxxxxxx>
---
   Documentation/driver-api/vduse.rst                 |   74 ++
   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
   drivers/vdpa/Kconfig                               |    8 +
   drivers/vdpa/Makefile                              |    1 +
   drivers/vdpa/vdpa_user/Makefile                    |    5 +
   drivers/vdpa/vdpa_user/eventfd.c                   |  221 ++++
   drivers/vdpa/vdpa_user/eventfd.h                   |   48 +
   drivers/vdpa/vdpa_user/iova_domain.c               |  442 ++++++++
   drivers/vdpa/vdpa_user/iova_domain.h               |   93 ++
   drivers/vdpa/vdpa_user/vduse.h                     |   59 ++
   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1121 ++++++++++++++++++++
   include/uapi/linux/vdpa.h                          |    1 +
   include/uapi/linux/vduse.h                         |   99 ++
   13 files changed, 2173 insertions(+)
   create mode 100644 Documentation/driver-api/vduse.rst
   create mode 100644 drivers/vdpa/vdpa_user/Makefile
   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
   create mode 100644 include/uapi/linux/vduse.h

diff --git a/Documentation/driver-api/vduse.rst b/Documentation/driver-api/vduse.rst
new file mode 100644
index 000000000000..da9b3040f20a
--- /dev/null
+++ b/Documentation/driver-api/vduse.rst
@@ -0,0 +1,74 @@
+==================================
+VDUSE - "vDPA Device in Userspace"
+==================================
+
+vDPA (virtio data path acceleration) device is a device that uses a
+datapath which complies with the virtio specifications with vendor
+specific control path. vDPA devices can be both physically located on
+the hardware or emulated by software. VDUSE is a framework that makes it
+possible to implement software-emulated vDPA devices in userspace.
+
+How VDUSE works
+------------
+Each userspace vDPA device is created by the VDUSE_CREATE_DEV ioctl on
+the VDUSE character device (/dev/vduse). Then a file descriptor pointing
+to the new resources will be returned, which can be used to implement the
+userspace vDPA device's control path and data path.
+
+To implement control path, the read/write operations to the file descriptor
+will be used to receive/reply the control messages from/to VDUSE driver.
+Those control messages are based on the vdpa_config_ops which defines a
+unified interface to control different types of vDPA device.
+
+The following types of messages are provided by the VDUSE framework now:
+
+- VDUSE_SET_VQ_ADDR: Set the addresses of the different aspects of virtqueue.
+
+- VDUSE_SET_VQ_NUM: Set the size of virtqueue
+
+- VDUSE_SET_VQ_READY: Set ready status of virtqueue
+
+- VDUSE_GET_VQ_READY: Get ready status of virtqueue
+
+- VDUSE_SET_FEATURES: Set virtio features supported by the driver
+
+- VDUSE_GET_FEATURES: Get virtio features supported by the device
+
+- VDUSE_SET_STATUS: Set the device status
+
+- VDUSE_GET_STATUS: Get the device status
+
+- VDUSE_SET_CONFIG: Write to device specific configuration space
+
+- VDUSE_GET_CONFIG: Read from device specific configuration space
+
+Please see include/linux/vdpa.h for details.
+
+In the data path, VDUSE framework implements a MMU-based on-chip IOMMU
+driver which supports mapping the kernel dma buffer to a userspace iova
+region dynamically. The userspace iova region can be created by passing
+the userspace vDPA device fd to mmap(2).
+
+Besides, the eventfd mechanism is used to trigger interrupt callbacks and
+receive virtqueue kicks in userspace. The following ioctls on the userspace
+vDPA device fd are provided to support that:
+
+- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
+  by VDUSE driver to notify userspace to consume the vring.
+
+- VDUSE_VQ_SETUP_IRQFD: set the irqfd for virtqueue, this eventfd is used
+  by userspace to notify VDUSE driver to trigger interrupt callbacks.
+
+MMU-based IOMMU Driver
+----------------------
+The basic idea behind the IOMMU driver is treating MMU (VA->PA) as
+IOMMU (IOVA->PA). This driver will set up MMU mapping instead of IOMMU mapping
+for the DMA transfer so that the userspace process is able to use its virtual
+address to access the dma buffer in kernel.
+
+And to avoid security issue, a bounce-buffering mechanism is introduced to
+prevent userspace accessing the original buffer directly which may contain other
+kernel data. During the mapping, unmapping, the driver will copy the data from
+the original buffer to the bounce buffer and back, depending on the direction of
+the transfer. And the bounce-buffer addresses will be mapped into the user address
+space instead of the original one.
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index a4c75a28c839..71722e6f8f23 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
   '|'   00-7F  linux/media.h
   0x80  00-1F  linux/fb.h
+0x81  00-1F  linux/vduse.h
   0x89  00-06  arch/x86/include/asm/sockios.h
   0x89  0B-DF  linux/sockios.h
   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
index 4be7be39be26..211cc449cbd3 100644
--- a/drivers/vdpa/Kconfig
+++ b/drivers/vdpa/Kconfig
@@ -21,6 +21,14 @@ config VDPA_SIM
         to RX. This device is used for testing, prototyping and
         development of vDPA.

+config VDPA_USER
+     tristate "VDUSE (vDPA Device in Userspace) support"
+     depends on EVENTFD && MMU && HAS_DMA
+     default n

The "default n" is not necessary.

OK.
+     help
+       With VDUSE it is possible to emulate a vDPA Device
+       in a userspace program.
+
   config IFCVF
       tristate "Intel IFC VF vDPA driver"
       depends on PCI_MSI
diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
index d160e9b63a66..66e97778ad03 100644
--- a/drivers/vdpa/Makefile
+++ b/drivers/vdpa/Makefile
@@ -1,5 +1,6 @@
   # SPDX-License-Identifier: GPL-2.0
   obj-$(CONFIG_VDPA) += vdpa.o
   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
+obj-$(CONFIG_VDPA_USER) += vdpa_user/
   obj-$(CONFIG_IFCVF)    += ifcvf/
   obj-$(CONFIG_MLX5_VDPA) += mlx5/
diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
new file mode 100644
index 000000000000..b7645e36992b
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+
+vduse-y := vduse_dev.o iova_domain.o eventfd.o

Do we really need eventfd.o here consider we've selected it.

Do you mean the file "drivers/vdpa/vdpa_user/eventfd.c"?


My bad, I confuse this with the common eventfd. So the code is fine here.



+
+obj-$(CONFIG_VDPA_USER) += vduse.o
diff --git a/drivers/vdpa/vdpa_user/eventfd.c b/drivers/vdpa/vdpa_user/eventfd.c
new file mode 100644
index 000000000000..dbffddb08908
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/eventfd.c
@@ -0,0 +1,221 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Eventfd support for VDUSE
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@xxxxxxxxxxxxx>
+ *
+ */
+
+#include <linux/eventfd.h>
+#include <linux/poll.h>
+#include <linux/wait.h>
+#include <linux/slab.h>
+#include <linux/file.h>
+#include <uapi/linux/vduse.h>
+
+#include "eventfd.h"
+
+static struct workqueue_struct *vduse_irqfd_cleanup_wq;
+
+static void vduse_virqfd_shutdown(struct work_struct *work)
+{
+     u64 cnt;
+     struct vduse_virqfd *virqfd = container_of(work,
+                                     struct vduse_virqfd, shutdown);
+
+     eventfd_ctx_remove_wait_queue(virqfd->ctx, &virqfd->wait, &cnt);
+     flush_work(&virqfd->inject);
+     eventfd_ctx_put(virqfd->ctx);
+     kfree(virqfd);
+}
+
+static void vduse_virqfd_inject(struct work_struct *work)
+{
+     struct vduse_virqfd *virqfd = container_of(work,
+                                     struct vduse_virqfd, inject);
+     struct vduse_virtqueue *vq = virqfd->vq;
+
+     spin_lock_irq(&vq->irq_lock);
+     if (vq->ready && vq->cb)
+             vq->cb(vq->private);
+     spin_unlock_irq(&vq->irq_lock);
+}
+
+static void virqfd_deactivate(struct vduse_virqfd *virqfd)
+{
+     queue_work(vduse_irqfd_cleanup_wq, &virqfd->shutdown);
+}
+
+static int vduse_virqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
+                             int sync, void *key)
+{
+     struct vduse_virqfd *virqfd = container_of(wait, struct vduse_virqfd, wait);
+     struct vduse_virtqueue *vq = virqfd->vq;
+
+     __poll_t flags = key_to_poll(key);
+
+     if (flags & EPOLLIN)
+             schedule_work(&virqfd->inject);
+
+     if (flags & EPOLLHUP) {
+             spin_lock(&vq->irq_lock);
+             if (vq->virqfd == virqfd) {
+                     vq->virqfd = NULL;
+                     virqfd_deactivate(virqfd);
+             }
+             spin_unlock(&vq->irq_lock);
+     }
+
+     return 0;
+}
+
+static void vduse_virqfd_ptable_queue_proc(struct file *file,
+                     wait_queue_head_t *wqh, poll_table *pt)
+{
+     struct vduse_virqfd *virqfd = container_of(pt, struct vduse_virqfd, pt);
+
+     add_wait_queue(wqh, &virqfd->wait);
+}
+
+int vduse_virqfd_setup(struct vduse_dev *dev,
+                     struct vduse_vq_eventfd *eventfd)
+{
+     struct vduse_virqfd *virqfd;
+     struct fd irqfd;
+     struct eventfd_ctx *ctx;
+     struct vduse_virtqueue *vq;
+     __poll_t events;
+     int ret;
+
+     if (eventfd->index >= dev->vq_num)
+             return -EINVAL;
+
+     vq = &dev->vqs[eventfd->index];
+     virqfd = kzalloc(sizeof(*virqfd), GFP_KERNEL);
+     if (!virqfd)
+             return -ENOMEM;
+
+     INIT_WORK(&virqfd->shutdown, vduse_virqfd_shutdown);
+     INIT_WORK(&virqfd->inject, vduse_virqfd_inject);

Any reason that a workqueue is must here?

Mainly for performance considerations. Make sure the push() and pop()
for used vring can be asynchronous.


I see.



+
+     ret = -EBADF;
+     irqfd = fdget(eventfd->fd);
+     if (!irqfd.file)
+             goto err_fd;
+
+     ctx = eventfd_ctx_fileget(irqfd.file);
+     if (IS_ERR(ctx)) {
+             ret = PTR_ERR(ctx);
+             goto err_ctx;
+     }
+
+     virqfd->vq = vq;
+     virqfd->ctx = ctx;
+     spin_lock(&vq->irq_lock);
+     if (vq->virqfd)
+             virqfd_deactivate(virqfd);
+     vq->virqfd = virqfd;
+     spin_unlock(&vq->irq_lock);
+
+     init_waitqueue_func_entry(&virqfd->wait, vduse_virqfd_wakeup);
+     init_poll_funcptr(&virqfd->pt, vduse_virqfd_ptable_queue_proc);
+
+     events = vfs_poll(irqfd.file, &virqfd->pt);
+
+     /*
+      * Check if there was an event already pending on the eventfd
+      * before we registered and trigger it as if we didn't miss it.
+      */
+     if (events & EPOLLIN)
+             schedule_work(&virqfd->inject);
+
+     fdput(irqfd);
+
+     return 0;
+err_ctx:
+     fdput(irqfd);
+err_fd:
+     kfree(virqfd);
+     return ret;
+}
+
+void vduse_virqfd_release(struct vduse_dev *dev)
+{
+     int i;
+
+     for (i = 0; i < dev->vq_num; i++) {
+             struct vduse_virtqueue *vq = &dev->vqs[i];
+
+             spin_lock(&vq->irq_lock);
+             if (vq->virqfd) {
+                     virqfd_deactivate(vq->virqfd);
+                     vq->virqfd = NULL;
+             }
+             spin_unlock(&vq->irq_lock);
+     }
+     flush_workqueue(vduse_irqfd_cleanup_wq);
+}
+
+int vduse_virqfd_init(void)
+{
+     vduse_irqfd_cleanup_wq = alloc_workqueue("vduse-irqfd-cleanup",
+                                             WQ_UNBOUND, 0);
+     if (!vduse_irqfd_cleanup_wq)
+             return -ENOMEM;
+
+     return 0;
+}
+
+void vduse_virqfd_exit(void)
+{
+     destroy_workqueue(vduse_irqfd_cleanup_wq);
+}
+
+void vduse_vq_kick(struct vduse_virtqueue *vq)
+{
+     spin_lock(&vq->kick_lock);
+     if (vq->ready && vq->kickfd)
+             eventfd_signal(vq->kickfd, 1);
+     spin_unlock(&vq->kick_lock);
+}
+
+int vduse_kickfd_setup(struct vduse_dev *dev,
+                     struct vduse_vq_eventfd *eventfd)
+{
+     struct eventfd_ctx *ctx;
+     struct vduse_virtqueue *vq;
+
+     if (eventfd->index >= dev->vq_num)
+             return -EINVAL;
+
+     vq = &dev->vqs[eventfd->index];
+     ctx = eventfd_ctx_fdget(eventfd->fd);
+     if (IS_ERR(ctx))
+             return PTR_ERR(ctx);
+
+     spin_lock(&vq->kick_lock);
+     if (vq->kickfd)
+             eventfd_ctx_put(vq->kickfd);
+     vq->kickfd = ctx;
+     spin_unlock(&vq->kick_lock);
+
+     return 0;
+}
+
+void vduse_kickfd_release(struct vduse_dev *dev)
+{
+     int i;
+
+     for (i = 0; i < dev->vq_num; i++) {
+             struct vduse_virtqueue *vq = &dev->vqs[i];
+
+             spin_lock(&vq->kick_lock);
+             if (vq->kickfd) {
+                     eventfd_ctx_put(vq->kickfd);
+                     vq->kickfd = NULL;
+             }
+             spin_unlock(&vq->kick_lock);
+     }
+}
diff --git a/drivers/vdpa/vdpa_user/eventfd.h b/drivers/vdpa/vdpa_user/eventfd.h
new file mode 100644
index 000000000000..14269ff27f47
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/eventfd.h
@@ -0,0 +1,48 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Eventfd support for VDUSE
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@xxxxxxxxxxxxx>
+ *
+ */
+
+#ifndef _VDUSE_EVENTFD_H
+#define _VDUSE_EVENTFD_H
+
+#include <linux/eventfd.h>
+#include <linux/poll.h>
+#include <linux/wait.h>
+#include <uapi/linux/vduse.h>
+
+#include "vduse.h"
+
+struct vduse_dev;
+
+struct vduse_virqfd {
+     struct eventfd_ctx *ctx;
+     struct vduse_virtqueue *vq;
+     struct work_struct inject;
+     struct work_struct shutdown;
+     wait_queue_entry_t wait;
+     poll_table pt;
+};
+
+int vduse_virqfd_setup(struct vduse_dev *dev,
+                     struct vduse_vq_eventfd *eventfd);
+
+void vduse_virqfd_release(struct vduse_dev *dev);
+
+int vduse_virqfd_init(void);
+
+void vduse_virqfd_exit(void);
+
+void vduse_vq_kick(struct vduse_virtqueue *vq);
+
+int vduse_kickfd_setup(struct vduse_dev *dev,
+                     struct vduse_vq_eventfd *eventfd);
+
+void vduse_kickfd_release(struct vduse_dev *dev);
+
+#endif /* _VDUSE_EVENTFD_H */
diff --git a/drivers/vdpa/vdpa_user/iova_domain.c b/drivers/vdpa/vdpa_user/iova_domain.c
new file mode 100644
index 000000000000..27022157abc6
--- /dev/null
+++ b/drivers/vdpa/vdpa_user/iova_domain.c
@@ -0,0 +1,442 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * MMU-based IOMMU implementation
+ *
+ * Copyright (C) 2020 Bytedance Inc. and/or its affiliates. All rights reserved.
+ *
+ * Author: Xie Yongji <xieyongji@xxxxxxxxxxxxx>
+ *
+ */
+
+#include <linux/wait.h>
+#include <linux/slab.h>
+#include <linux/genalloc.h>
+#include <linux/dma-mapping.h>
+
+#include "iova_domain.h"
+
+#define IOVA_CHUNK_SHIFT 26
+#define IOVA_CHUNK_SIZE (_AC(1, UL) << IOVA_CHUNK_SHIFT)
+#define IOVA_CHUNK_MASK (~(IOVA_CHUNK_SIZE - 1))
+
+#define IOVA_MIN_SIZE (IOVA_CHUNK_SIZE << 1)
+
+#define IOVA_ALLOC_ORDER 12
+#define IOVA_ALLOC_SIZE (1 << IOVA_ALLOC_ORDER)
+
+struct vduse_mmap_vma {
+     struct vm_area_struct *vma;
+     struct list_head list;
+};
+
+static inline struct page *
+vduse_domain_get_bounce_page(struct vduse_iova_domain *domain,
+                             unsigned long iova)
+{
+     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
+     unsigned long pgindex = chunkoff >> PAGE_SHIFT;
+
+     return domain->chunks[index].bounce_pages[pgindex];
+}
+
+static inline void
+vduse_domain_set_bounce_page(struct vduse_iova_domain *domain,
+                             unsigned long iova, struct page *page)
+{
+     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
+     unsigned long pgindex = chunkoff >> PAGE_SHIFT;
+
+     domain->chunks[index].bounce_pages[pgindex] = page;
+}
+
+static inline struct vduse_iova_map *
+vduse_domain_get_iova_map(struct vduse_iova_domain *domain,
+                             unsigned long iova)
+{
+     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
+     unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
+
+     return domain->chunks[index].iova_map[mapindex];
+}
+
+static inline void
+vduse_domain_set_iova_map(struct vduse_iova_domain *domain,
+                     unsigned long iova, struct vduse_iova_map *map)
+{
+     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+     unsigned long chunkoff = iova & ~IOVA_CHUNK_MASK;
+     unsigned long mapindex = chunkoff >> IOVA_ALLOC_ORDER;
+
+     domain->chunks[index].iova_map[mapindex] = map;
+}
+
+static int
+vduse_domain_free_bounce_pages(struct vduse_iova_domain *domain,
+                             unsigned long iova, size_t size)
+{
+     struct page *page;
+     size_t walk_sz = 0;
+     int frees = 0;
+
+     while (walk_sz < size) {
+             page = vduse_domain_get_bounce_page(domain, iova);
+             if (page) {
+                     vduse_domain_set_bounce_page(domain, iova, NULL);
+                     put_page(page);
+                     frees++;
+             }
+             iova += PAGE_SIZE;
+             walk_sz += PAGE_SIZE;
+     }
+
+     return frees;
+}
+
+int vduse_domain_add_vma(struct vduse_iova_domain *domain,
+                             struct vm_area_struct *vma)
+{
+     unsigned long size = vma->vm_end - vma->vm_start;
+     struct vduse_mmap_vma *mmap_vma;
+
+     if (WARN_ON(size != domain->size))
+             return -EINVAL;
+
+     mmap_vma = kmalloc(sizeof(*mmap_vma), GFP_KERNEL);
+     if (!mmap_vma)
+             return -ENOMEM;
+
+     mmap_vma->vma = vma;
+     mutex_lock(&domain->vma_lock);
+     list_add(&mmap_vma->list, &domain->vma_list);
+     mutex_unlock(&domain->vma_lock);
+
+     return 0;
+}
+
+void vduse_domain_remove_vma(struct vduse_iova_domain *domain,
+                             struct vm_area_struct *vma)
+{
+     struct vduse_mmap_vma *mmap_vma;
+
+     mutex_lock(&domain->vma_lock);
+     list_for_each_entry(mmap_vma, &domain->vma_list, list) {
+             if (mmap_vma->vma == vma) {
+                     list_del(&mmap_vma->list);
+                     kfree(mmap_vma);
+                     break;
+             }
+     }
+     mutex_unlock(&domain->vma_lock);
+}
+
+int vduse_domain_add_mapping(struct vduse_iova_domain *domain,
+                             unsigned long iova, unsigned long orig,
+                             size_t size, enum dma_data_direction dir)
+{
+     struct vduse_iova_map *map;
+     unsigned long last = iova + size;
+
+     map = kzalloc(sizeof(struct vduse_iova_map), GFP_ATOMIC);
+     if (!map)
+             return -ENOMEM;
+
+     map->iova = iova;
+     map->orig = orig;
+     map->size = size;
+     map->dir = dir;
+
+     while (iova < last) {
+             vduse_domain_set_iova_map(domain, iova, map);
+             iova += IOVA_ALLOC_SIZE;
+     }
+
+     return 0;
+}
+
+struct vduse_iova_map *
+vduse_domain_get_mapping(struct vduse_iova_domain *domain,
+                     unsigned long iova)
+{
+     return vduse_domain_get_iova_map(domain, iova);
+}
+
+void vduse_domain_remove_mapping(struct vduse_iova_domain *domain,
+                             struct vduse_iova_map *map)
+{
+     unsigned long iova = map->iova;
+     unsigned long last = iova + map->size;
+
+     while (iova < last) {
+             vduse_domain_set_iova_map(domain, iova, NULL);
+             iova += IOVA_ALLOC_SIZE;
+     }
+}
+
+void vduse_domain_unmap(struct vduse_iova_domain *domain,
+                     unsigned long iova, size_t size)
+{
+     struct vduse_mmap_vma *mmap_vma;
+     unsigned long uaddr;
+
+     mutex_lock(&domain->vma_lock);
+     list_for_each_entry(mmap_vma, &domain->vma_list, list) {
+             mmap_read_lock(mmap_vma->vma->vm_mm);
+             uaddr = iova + mmap_vma->vma->vm_start;
+             zap_page_range(mmap_vma->vma, uaddr, size);
+             mmap_read_unlock(mmap_vma->vma->vm_mm);
+     }
+     mutex_unlock(&domain->vma_lock);
+}
+
+int vduse_domain_direct_map(struct vduse_iova_domain *domain,
+                     struct vm_area_struct *vma, unsigned long iova)
+{
+     unsigned long uaddr = iova + vma->vm_start;
+     unsigned long start = iova & PAGE_MASK;
+     unsigned long last = start + PAGE_SIZE - 1;
+     unsigned long offset;
+     struct vduse_iova_map *map;
+     struct page *page = NULL;
+
+     map = vduse_domain_get_iova_map(domain, iova);
+     if (map) {
+             offset = last - map->iova;
+             page = virt_to_page(map->orig + offset);
+     }
+
+     return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
+}

So as we discussed before, we need to find way to make vhost work. And
it's better to make vhost transparent to VDUSE. One idea is to implement
shadow virtqueue here, that is, instead of trying to insert the pages to
VDUSE userspace, we use the shadow virtqueue to relay the descriptors to
userspace. With this, we don't need stuffs like shmfd etc.

Good idea! The disadvantage is performance will go down (one more
thread switch overhead and vhost-liked kworker will become bottleneck
without multi-thread support).


Yes, the disadvantage is the performance. But it should be simpler (I guess) and we know it can succeed.



I think I can try this in v3. And the
MMU-based IOMMU implementation can be a future optimization in the
virtio-vdpa case. What's your opinion?


Maybe I was wrong, but I think we can try as what has been proposed here first and use shadow virtqueue as backup plan if we fail.



+
+void vduse_domain_bounce(struct vduse_iova_domain *domain,
+                     unsigned long iova, unsigned long orig,
+                     size_t size, enum dma_data_direction dir)
+{
+     unsigned int offset = offset_in_page(iova);
+
+     while (size) {
+             struct page *p = vduse_domain_get_bounce_page(domain, iova);
+             size_t copy_len = min_t(size_t, PAGE_SIZE - offset, size);
+             void *addr;
+
+             if (p) {
+                     addr = page_address(p) + offset;
+                     if (dir == DMA_TO_DEVICE)
+                             memcpy(addr, (void *)orig, copy_len);
+                     else if (dir == DMA_FROM_DEVICE)
+                             memcpy((void *)orig, addr, copy_len);
+             }

I think I miss something, for DMA_FROM_DEVICE, if p doesn't exist how is
it expected to work? Or do we need to warn here in this case?

Yes, I think we need a WARN_ON here.


Ok.




+             size -= copy_len;
+             orig += copy_len;
+             iova += copy_len;
+             offset = 0;
+     }
+}
+
+int vduse_domain_bounce_map(struct vduse_iova_domain *domain,
+                     struct vm_area_struct *vma, unsigned long iova)
+{
+     unsigned long uaddr = iova + vma->vm_start;
+     unsigned long start = iova & PAGE_MASK;
+     unsigned long offset = 0;
+     bool found = false;
+     struct vduse_iova_map *map;
+     struct page *page;
+
+     mutex_lock(&domain->map_lock);
+
+     page = vduse_domain_get_bounce_page(domain, iova);
+     if (page)
+             goto unlock;
+
+     page = alloc_page(GFP_KERNEL);
+     if (!page)
+             goto unlock;
+
+     while (offset < PAGE_SIZE) {
+             unsigned int src_offset = 0, dst_offset = 0;
+             void *src, *dst;
+             size_t copy_len;
+
+             map = vduse_domain_get_iova_map(domain, start + offset);
+             if (!map) {
+                     offset += IOVA_ALLOC_SIZE;
+                     continue;
+             }
+
+             found = true;
+             offset += map->size;
+             if (map->dir == DMA_FROM_DEVICE)
+                     continue;
+
+             if (start > map->iova)
+                     src_offset = start - map->iova;
+             else
+                     dst_offset = map->iova - start;
+
+             src = (void *)(map->orig + src_offset);
+             dst = page_address(page) + dst_offset;
+             copy_len = min_t(size_t, map->size - src_offset,
+                             PAGE_SIZE - dst_offset);
+             memcpy(dst, src, copy_len);
+     }
+     if (!found) {
+             put_page(page);
+             page = NULL;
+     }
+     vduse_domain_set_bounce_page(domain, iova, page);
+unlock:
+     mutex_unlock(&domain->map_lock);
+
+     return page ? vm_insert_page(vma, uaddr, page) : -EFAULT;
+}
+
+bool vduse_domain_is_direct_map(struct vduse_iova_domain *domain,
+                             unsigned long iova)
+{
+     unsigned long index = iova >> IOVA_CHUNK_SHIFT;
+     struct vduse_iova_chunk *chunk = &domain->chunks[index];
+
+     return atomic_read(&chunk->map_type) == TYPE_DIRECT_MAP;
+}
+
+unsigned long vduse_domain_alloc_iova(struct vduse_iova_domain *domain,
+                                     size_t size, enum iova_map_type type)
+{
+     struct vduse_iova_chunk *chunk;
+     unsigned long iova = 0;
+     int align = (type == TYPE_DIRECT_MAP) ? PAGE_SIZE : IOVA_ALLOC_SIZE;
+     struct genpool_data_align data = { .align = align };
+     int i;
+
+     for (i = 0; i < domain->chunk_num; i++) {
+             chunk = &domain->chunks[i];
+             if (unlikely(atomic_read(&chunk->map_type) == TYPE_NONE))
+                     atomic_cmpxchg(&chunk->map_type, TYPE_NONE, type);
+
+             if (atomic_read(&chunk->map_type) != type)
+                     continue;
+
+             iova = gen_pool_alloc_algo(chunk->pool, size,
+                                     gen_pool_first_fit_align, &data);
+             if (iova)
+                     break;
+     }
+
+     return iova;

I wonder why not just reuse the iova domain implements in
driver/iommu/iova.c

The iova domain in driver/iommu/iova.c is only an iova allocator which
is implemented by the genpool memory allocator in our case. The other
part in our iova domain is chunk management and iova_map management.
We need different chunks to distinguish different dma mapping types:
consistent mapping or streaming mapping. We can only use
bouncing-mechanism in the streaming mapping case.


To differ dma mappings, you can use two iova domains with different ranges. It looks simpler than the gen_pool. (AFAIK most IOMMU driver is using iova domain).

Thanks






[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux