Today's PCIe devices are capable of using peer-to-peer transactions to communicate directly with other devices. This has been shown to improve latency with communication between a GPU and HCA [1] as well as when transferring between storage and an HCA [2]. There is some missing functionality in the kernel to support this mode of operation. The RDMA subsystem uses get_user_pages() to pin user memory for memory regions used by its devices. There has been some attempts to add peer-to-peer transaction support to the kernel. First, some linux-rdma patches that attempted to extend the RDMA subsystem with a plugin mechanism that would detect user pointers as belonging to a peer device [3,4] and request the bus address mappings from the plugin. These has been rejected saying that the RDMA subsystem is not the place to keep such a plugin mechanism. Second, in newer kernels the ZONE_DEVICE feature allows persistent memory devices to describe their pages with a page struct, allowing for example for a get_user_pages() call from the RDMA stack to succeed. However, the current ZONE_DEVICE code only supports devices whose memory can be cached by the CPU, and patches to change that [5] has been rejected so far. A third, more long term alternative is to use HMM [6]. HMM allows migrating anonymous memory to a device. It could possibly be extended to allow requesting a mapping for peer-to-peer access from another device. However, HMM is complex and it could take a long time until it is accepted. This patch series attempts to use the existing DMA-BUF mechanism in the kernel to provide peer-to-peer transaction support for RDMA devices. DMA-BUF allows one device to attach to a buffer exported by another device. The series allows an RDMA application to use a new kernel API to create an RDMA memory region from a DMA-BUF file descriptor handle. The DMA-BUF object is pinned via the dma_buf_attach() call, and the code then uses dma_buf_map_attachment to get the mapping to the buffer and registers them with the RDMA device. In order to provide an example use of the series, I've added DMA-BUF support to the NVMe driver. The code adds a ioctl to allocate CMB regions as DMA-BUF objects. These objects can then be used to register a memory region with the new API. The series is structured as follows. The first two patches are preliminary patches needed for the mlx5_ib driver. They make it easier for the following patches to post a memory registration request using existing infrastructure. Patches 3-4 implement helper functions for DMA-BUF registration in the RDMA subsystem core, and expose a user-space command to do the registration. Patch 5 implements the new registration command in the mlx5_ib driver. Finally patches 6-7 add DMA-BUF support to the NVMe driver. Caveats: * Without having real NVMe device with CMB support I smoke-tested the patches with qemu-nvme [7] while assigning a Connect-IB card to the VM. Naturally this doesn't work. It remains to be tested with real devices on a system supporting peer to peer. * GPU devices sometimes require either changing the addresses exposed on their BARs, or just invalidating them altogether. DMA-BUF however doesn't have such a mechanism right now, since it assumes cross device accesses will be short. DMA-BUF would need to be extended with a way to invalidate an existing attachment and the RDMA stack would need a way to invalidate existing MRs, similarly to what was done in [3,4]. * It is currently the responsibility of the DMA-BUF exporting code to make sure the buffer can be mapped for access by the attaching device. It is not always possible to do that (e.g. when the PCIe topology does not allow it) or the IOMMU needs to be configured accordingly. * The scatterlists returned from DMA-BUF in this implementation do not have a struct page set. This is fine for RDMA as it only needs the DMA addresses. However, it may cause issues if someone tries to import these DMA-BUF objects into a driver that tries to use these page structs. Comments are welcome. Haggai [1] Benchmarking GPUDirect RDMA on Modern Server Platforms, Davide Rosetti https://devblogs.nvidia.com/parallelforall/benchmarking-gpudirect-rdma-on-modern-server-platforms/ [2] Project Donard: Peer-to-peer Communication with NVM Express Devices, Stephen Bates, http://blog.pmcs.com/project-donard-peer-to-peer-communication-with-nvm-express-devices-part-1/ [3] [PATCH V2 for-next 0/9] Peer-Direct support Yishai Hadas, https://www.spinics.net/lists/linux-rdma/msg21770.html [4] [RFC 0/7] Peer-direct memory, Artemy Kovalyov, https://www.spinics.net/lists/linux-rdma/msg33294.html [5] [PATCH RFC 1/1] Add support for ZONE_DEVICE IO memory with struct pages, Stephen Bates https://www.spinics.net/lists/linux-rdma/msg34408.html [6] HMM (Heterogeneous Memory Management), Jérôme Glisse, https://lkml.org/lkml/2016/3/8/721 [7] https://github.com/OpenChannelSSD/qemu-nvme Haggai Eran (7): IB/mlx5: Helper for posting work-requests on the UMR QP IB/mlx5: Support registration and invalidate operations on the UMR QP IB/core: Helpers for mapping DMA-BUF in MRs IB/uverbs: Add command to register a DMA-BUF fd IB/mlx5: Implement reg_user_dma_buf_mr NVMe: Use genalloc to allocate CMB regions NVMe: CMB on DMA-BUF drivers/infiniband/core/uverbs.h | 1 + drivers/infiniband/core/uverbs_cmd.c | 111 ++++++++++++ drivers/infiniband/core/uverbs_main.c | 1 + drivers/infiniband/core/verbs.c | 60 +++++++ drivers/infiniband/hw/mlx5/main.c | 10 +- drivers/infiniband/hw/mlx5/mlx5_ib.h | 4 + drivers/infiniband/hw/mlx5/mr.c | 149 ++++++++-------- drivers/infiniband/hw/mlx5/qp.c | 71 +++++--- drivers/nvme/host/Makefile | 2 +- drivers/nvme/host/core.c | 29 ++++ drivers/nvme/host/dmabuf.c | 308 ++++++++++++++++++++++++++++++++++ drivers/nvme/host/nvme-pci.h | 26 +++ drivers/nvme/host/nvme.h | 1 + drivers/nvme/host/pci.c | 60 ++++++- include/rdma/ib_verbs.h | 15 ++ include/uapi/linux/nvme_ioctl.h | 11 ++ include/uapi/rdma/ib_user_verbs.h | 18 ++ 17 files changed, 778 insertions(+), 99 deletions(-) create mode 100644 drivers/nvme/host/dmabuf.c create mode 100644 drivers/nvme/host/nvme-pci.h -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html