On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote: > On Wed, 2018-02-28 at 16:39 -0700, Logan Gunthorpe wrote: > > Hi Everyone, > > > So Oliver (CC) was having issues getting any of that to work for us. > > The problem is that acccording to him (I didn't double check the latest > patches) you effectively hotplug the PCIe memory into the system when > creating struct pages. > > This cannot possibly work for us. First we cannot map PCIe memory as > cachable. (Note that doing so is a bad idea if you are behind a PLX > switch anyway since you'd ahve to manage cache coherency in SW). Note: I think the above means it won't work behind a switch on x86 either, will it ? > Then our MMIO space is so far away from our memory space that there is > not enough vmemmap virtual space to be able to do that. > > So this can only work accross achitectures by using something like HMM > to create special device struct page's. > > Ben. > > > > Here's v2 of our series to introduce P2P based copy offload to NVMe > > fabrics. This version has been rebased onto v4.16-rc3 which already > > includes Christoph's devpagemap work the previous version was based > > off as well as a couple of the cleanup patches that were in v1. > > > > Additionally, we've made the following changes based on feedback: > > > > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well > > as a bunch of cleanup and spelling fixes he pointed out in the last > > series. > > > > * To address Alex's ACS concerns, we change to a simpler method of > > just disabling ACS behind switches for any kernel that has > > CONFIG_PCI_P2PDMA. > > > > * We also reject using devices that employ 'dma_virt_ops' which should > > fairly simply handle Jason's concerns that this work might break with > > the HFI, QIB and rxe drivers that use the virtual ops to implement > > their own special DMA operations. > > > > Thanks, > > > > Logan > > > > -- > > > > This is a continuation of our work to enable using Peer-to-Peer PCI > > memory in NVMe fabrics targets. Many thanks go to Christoph Hellwig who > > provided valuable feedback to get these patches to where they are today. > > > > The concept here is to use memory that's exposed on a PCI BAR as > > data buffers in the NVME target code such that data can be transferred > > from an RDMA NIC to the special memory and then directly to an NVMe > > device avoiding system memory entirely. The upside of this is better > > QoS for applications running on the CPU utilizing memory and lower > > PCI bandwidth required to the CPU (such that systems could be designed > > with fewer lanes connected to the CPU). However, presently, the > > trade-off is currently a reduction in overall throughput. (Largely due > > to hardware issues that would certainly improve in the future). > > > > Due to these trade-offs we've designed the system to only enable using > > the PCI memory in cases where the NIC, NVMe devices and memory are all > > behind the same PCI switch. This will mean many setups that could likely > > work well will not be supported so that we can be more confident it > > will work and not place any responsibility on the user to understand > > their topology. (We chose to go this route based on feedback we > > received at the last LSF). Future work may enable these transfers behind > > a fabric of PCI switches or perhaps using a white list of known good > > root complexes. > > > > In order to enable this functionality, we introduce a few new PCI > > functions such that a driver can register P2P memory with the system. > > Struct pages are created for this memory using devm_memremap_pages() > > and the PCI bus offset is stored in the corresponding pagemap structure. > > > > Another set of functions allow a client driver to create a list of > > client devices that will be used in a given P2P transactions and then > > use that list to find any P2P memory that is supported by all the > > client devices. This list is then also used to selectively disable the > > ACS bits for the downstream ports behind these devices. > > > > In the block layer, we also introduce a P2P request flag to indicate a > > given request targets P2P memory as well as a flag for a request queue > > to indicate a given queue supports targeting P2P memory. P2P requests > > will only be accepted by queues that support it. Also, P2P requests > > are marked to not be merged seeing a non-homogenous request would > > complicate the DMA mapping requirements. > > > > In the PCI NVMe driver, we modify the existing CMB support to utilize > > the new PCI P2P memory infrastructure and also add support for P2P > > memory in its request queue. When a P2P request is received it uses the > > pci_p2pmem_map_sg() function which applies the necessary transformation > > to get the corrent pci_bus_addr_t for the DMA transactions. > > > > In the RDMA core, we also adjust rdma_rw_ctx_init() and > > rdma_rw_ctx_destroy() to take a flags argument which indicates whether > > to use the PCI P2P mapping functions or not. > > > > Finally, in the NVMe fabrics target port we introduce a new > > configuration boolean: 'allow_p2pmem'. When set, the port will attempt > > to find P2P memory supported by the RDMA NIC and all namespaces. If > > supported memory is found, it will be used in all IO transfers. And if > > a port is using P2P memory, adding new namespaces that are not supported > > by that memory will fail. > > > > Logan Gunthorpe (10): > > PCI/P2PDMA: Support peer to peer memory > > PCI/P2PDMA: Add sysfs group to display p2pmem stats > > PCI/P2PDMA: Add PCI p2pmem dma mappings to adjust the bus offset > > PCI/P2PDMA: Clear ACS P2P flags for all devices behind switches > > block: Introduce PCI P2P flags for request and request queue > > IB/core: Add optional PCI P2P flag to rdma_rw_ctx_[init|destroy]() > > nvme-pci: Use PCI p2pmem subsystem to manage the CMB > > nvme-pci: Add support for P2P memory in requests > > nvme-pci: Add a quirk for a pseudo CMB > > nvmet: Optionally use PCI P2P memory > > > > Documentation/ABI/testing/sysfs-bus-pci | 25 ++ > > block/blk-core.c | 3 + > > drivers/infiniband/core/rw.c | 21 +- > > drivers/infiniband/ulp/isert/ib_isert.c | 5 +- > > drivers/infiniband/ulp/srpt/ib_srpt.c | 7 +- > > drivers/nvme/host/core.c | 4 + > > drivers/nvme/host/nvme.h | 8 + > > drivers/nvme/host/pci.c | 118 ++++-- > > drivers/nvme/target/configfs.c | 29 ++ > > drivers/nvme/target/core.c | 95 ++++- > > drivers/nvme/target/io-cmd.c | 3 + > > drivers/nvme/target/nvmet.h | 10 + > > drivers/nvme/target/rdma.c | 43 +- > > drivers/pci/Kconfig | 20 + > > drivers/pci/Makefile | 1 + > > drivers/pci/p2pdma.c | 713 ++++++++++++++++++++++++++++++++ > > drivers/pci/pci.c | 4 + > > include/linux/blk_types.h | 18 +- > > include/linux/blkdev.h | 3 + > > include/linux/memremap.h | 19 + > > include/linux/pci-p2pdma.h | 105 +++++ > > include/linux/pci.h | 4 + > > include/rdma/rw.h | 7 +- > > net/sunrpc/xprtrdma/svc_rdma_rw.c | 6 +- > > 24 files changed, 1204 insertions(+), 67 deletions(-) > > create mode 100644 drivers/pci/p2pdma.c > > create mode 100644 include/linux/pci-p2pdma.h > > > > -- > > 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html