On Thu, Oct 04, 2018 at 03:27:34PM -0600, Logan Gunthorpe wrote: > This is v9 for the p2pdma patch set. It has some substantial changes > from the previous version. Essentially, the last patch has changed > based on information from Sagi that the P2P memory can be assigned > per namespace instead of globally. To that end, I've added a > radix tree that stores the p2p devices to use for each namespace and > a lookup will be done on that before allocating the memory. > > This has some knock on effects of simplifying the requirements > from the p2pdma code and so we drop all the client list code seeing > we don't need to worry about selecting a P2P device that works > with every namespace at once; only each namespace independantly. > Thus, Patch 1, 6 and 13 have significant changes. However, > I think the removal of the client list code will be generally > appreciated based on the feedback I've gotten from Christian. > > As usual, a git repo of the branch is available here: > > https://github.com/sbates130272/linux-p2pmem pci-p2p-v9 > > Thanks, > > Logan > > -- > > Changes in v9: > * Rebased on v4.19-rc6 > * Droped pci_p2pdma_add_client(), pci_p2pdma_remove_client() and > pci_p2pdma_client_list_free(), and reworked pci_p2pdma_distance() and > pci_p2pmem_find() to take simple arrays instead of list_heads. > * Updated the documentation to match > * Reworked the nvmet code to have a configfs attribute per namespace > instead of per port. Then each namespace has it's own P2P device > stored in a radix-tree in the controller (it can't be stored in > the nvme_ns structure seeing it needs to be unique per controller). > * Dropped the 'sq' argument in nvmet_req_alloc_sgl() seeing I've noticed > it is now available in the nvmet_req structure. > * Collected Keith's reviewed-by tags (however, I have not applied his > tag to the last patch seeing I think it has changed too much since > his review). > > Changes in v8: > > * Added a couple of comments to address Bart's feedback and > removed the bogus call to percpu_ref_switch_to_atomic_sync() > > Changes in v7: > > * Rebased on v4.19-rc5 > > * Fixed up commit message of patch 7 that was no longer accurate. (as > pointed out by Jens) > * Change the configfs to not use "auto" or "none" and instead just > use a 0/1/<pci_dev> (or boolean). This matches the existing > nvme-target configfs booleans. (Per Bjorn) > * A handful of other minor changes and edits that were noticed by Bjorn > * Collected Acks from Bjorn > > Changes in v6: > > * Rebased on v4.19-rc3 > > * Remove the changes to the common submit_bio() path and instead > set REQ_NOMERGE in the NVME target driver, when appropriate. > Per discussions with Jens and Christoph. > > * Some minor grammar changes in the documentation as spotted by Randy. > > Changes in v5: > > * Rebased on v4.19-rc1 > > * Drop changing ACS settings in this patchset. Now, the code > will only allow P2P transactions between devices whos > downstream ports do not restrict P2P TLPs. > > * Drop the REQ_PCI_P2PDMA block flag and instead use > is_pci_p2pdma_page() to tell if a request is P2P or not. In that > case we check for queue support and enforce using REQ_NOMERGE. > Per feedback from Christoph. > > * Drop the pci_p2pdma_unmap_sg() function as it was empty and only > there for symmetry and compatibility with dma_unmap_sg. Per feedback > from Christoph. > > * Split off the logic to handle enabling P2P in NVMe fabrics' configfs > into specific helpers in the p2pdma code. Per feedback from Christoph. > > * A number of other minor cleanups and fixes as pointed out by > Christoph and others. > > Changes in v4: > > * Change the original upstream_bridges_match() function to > upstream_bridge_distance() which calculates the distance between two > devices as long as they are behind the same root port. This should > address Bjorn's concerns that the code was to focused on > being behind a single switch. > > * The disable ACS function now disables ACS for all bridge ports instead > of switch ports (ie. those that had two upstream_bridge ports). > > * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl() > API to be more like sgl_alloc() in that the alloc function returns > the allocated scatterlist and nents is not required bythe free > function. > > * Moved the new documentation into the driver-api tree as requested > by Jonathan > > * Add SGL alloc and free helpers in the nvmet code so that the > individual drivers can share the code that allocates P2P memory. > As requested by Christoph. > > * Cleanup the nvmet_p2pmem_store() function as Christoph > thought my first attempt was ugly. > > * Numerous commit message and comment fix-ups > > Changes in v3: > > * Many more fixes and minor cleanups that were spotted by Bjorn > > * Additional explanation of the ACS change in both the commit message > and Kconfig doc. Also, the code that disables the ACS bits is surrounded > explicitly by an #ifdef > > * Removed the flag we added to rdma_rw_ctx() in favour of using > is_pci_p2pdma_page(), as suggested by Sagi. > > * Adjust pci_p2pmem_find() so that it prefers P2P providers that > are closest to (or the same as) the clients using them. In cases > of ties, the provider is randomly chosen. > > * Modify the NVMe Target code so that the PCI device name of the provider > may be explicitly specified, bypassing the logic in pci_p2pmem_find(). > (Note: it's still enforced that the provider must be behind the > same switch as the clients). > > * As requested by Bjorn, added documentation for driver writers. > > > Changes in v2: > > * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well > as a bunch of cleanup and spelling fixes he pointed out in the last > series. > > * To address Alex's ACS concerns, we change to a simpler method of > just disabling ACS behind switches for any kernel that has > CONFIG_PCI_P2PDMA. > > * We also reject using devices that employ 'dma_virt_ops' which should > fairly simply handle Jason's concerns that this work might break with > the HFI, QIB and rxe drivers that use the virtual ops to implement > their own special DMA operations. > > -- > > This is a continuation of our work to enable using Peer-to-Peer PCI > memory in the kernel with initial support for the NVMe fabrics target > subsystem. Many thanks go to Christoph Hellwig who provided valuable > feedback to get these patches to where they are today. > > The concept here is to use memory that's exposed on a PCI BAR as > data buffers in the NVMe target code such that data can be transferred > from an RDMA NIC to the special memory and then directly to an NVMe > device avoiding system memory entirely. The upside of this is better > QoS for applications running on the CPU utilizing memory and lower > PCI bandwidth required to the CPU (such that systems could be designed > with fewer lanes connected to the CPU). > > Due to these trade-offs we've designed the system to only enable using > the PCI memory in cases where the NIC, NVMe devices and memory are all > behind the same PCI switch hierarchy. This will mean many setups that > could likely work well will not be supported so that we can be more > confident it will work and not place any responsibility on the user to > understand their topology. (We chose to go this route based on feedback > we received at the last LSF). Future work may enable these transfers > using a white list of known good root complexes. However, at this time, > there is no reliable way to ensure that Peer-to-Peer transactions are > permitted between PCI Root Ports. > > For PCI P2P DMA transfers to work in this situation the ACS bits > must be disabled on the downstream ports (DSPs) for all devices > involved in the transfer. This can be done using the "disable_acs_redir" > PCI kernel command line option which was introduced in v4.19. > > In order to enable PCI P2P functionality, we introduce a few new PCI > functions such that a driver can register P2P memory with the system. > Struct pages are created for this memory using devm_memremap_pages() > and the PCI bus offset is stored in the corresponding pagemap structure. > > Another set of functions allow a client driver to create a list of > client devices that will be used in a given P2P transactions and then > use that list to find any P2P memory that is supported by all the > client devices. > > In the block layer, we also introduce a flag for a request queue > to indicate a given queue supports targeting P2P memory. The driver > submitting bios must ensure that the queue supports P2P before > attempting to submit BIOs backed by P2P memory. Also, P2P requests > are marked to not be merged seeing a non-homogenous request would > complicate the DMA mapping requirements. > > In the PCI NVMe driver, we modify the existing CMB support to utilize > the new PCI P2P memory infrastructure and also add support for P2P > memory in its request queue. When a P2P request is received it uses the > pci_p2pmem_map_sg() function which applies the necessary transformation > to get the corrent pci_bus_addr_t for the DMA transactions. > > In the RDMA core, we also adjust rdma_rw_ctx_init() and > rdma_rw_ctx_destroy() to take a flags argument which indicates whether > to use the PCI P2P mapping functions or not. To avoid odd RDMA devices > that don't use the proper DMA infrastructure this code rejects using > any device that employs the virt_dma_ops implementation. > > Finally, in the NVMe fabrics target port we introduce a new > configuration attribute: 'p2pmem'. When set to a true boolean, the port > will attempt to find P2P memory supported by the RDMA NIC and all namespaces. > It may also be set to a PCI device name to select a specific P2P > memory to use. If supported memory is found, it will be used in all IO > transfers. And if a port is using P2P memory, adding new namespaces that > are not supported by that memory will fail. > > These patches have been tested on a number of Intel based systems and > for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe > SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom, > Microsemi, Chelsio and Everspin) using switches from both Microsemi > and Broadcomm. > > -- > > Logan Gunthorpe (13): > PCI/P2PDMA: Support peer-to-peer memory > PCI/P2PDMA: Add sysfs group to display p2pmem stats > PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset > PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers > docs-rst: Add a new directory for PCI documentation > PCI/P2PDMA: Add P2P DMA driver writer's documentation > block: Add PCI P2P flag for request queue and check support for > requests > IB/core: Ensure we map P2P memory correctly in > rdma_rw_ctx_[init|destroy]() > nvme-pci: Use PCI p2pmem subsystem to manage the CMB > nvme-pci: Add support for P2P memory in requests > nvme-pci: Add a quirk for a pseudo CMB > nvmet: Introduce helper functions to allocate and free request SGLs > nvmet: Optionally use PCI P2P memory > > Documentation/ABI/testing/sysfs-bus-pci | 24 + > Documentation/driver-api/index.rst | 2 +- > Documentation/driver-api/pci/index.rst | 21 + > Documentation/driver-api/pci/p2pdma.rst | 145 ++++ > Documentation/driver-api/{ => pci}/pci.rst | 0 > drivers/infiniband/core/rw.c | 11 +- > drivers/nvme/host/core.c | 4 + > drivers/nvme/host/nvme.h | 8 + > drivers/nvme/host/pci.c | 121 +-- > drivers/nvme/target/configfs.c | 45 ++ > drivers/nvme/target/core.c | 180 +++++ > drivers/nvme/target/io-cmd-bdev.c | 3 + > drivers/nvme/target/nvmet.h | 17 + > drivers/nvme/target/rdma.c | 22 +- > drivers/pci/Kconfig | 17 + > drivers/pci/Makefile | 1 + > drivers/pci/p2pdma.c | 809 +++++++++++++++++++++ > include/linux/blkdev.h | 3 + > include/linux/memremap.h | 6 + > include/linux/mm.h | 18 + > include/linux/pci-p2pdma.h | 114 +++ > include/linux/pci.h | 4 + > 22 files changed, 1521 insertions(+), 54 deletions(-) > create mode 100644 Documentation/driver-api/pci/index.rst > create mode 100644 Documentation/driver-api/pci/p2pdma.rst > rename Documentation/driver-api/{ => pci}/pci.rst (100%) > create mode 100644 drivers/pci/p2pdma.c > create mode 100644 include/linux/pci-p2pdma.h I added the reviewed-by tags from Christoph, Jens' ack on the blkdev.h change, and applied these to pci/peer-to-peer with the intent of merging these for v4.20. I gave up on waiting for an ack for the memremap.h and mm.h changes. I dropped the "nvme-pci: Add a quirk for a pseudo CMB" quirk because of Christoph's objection. After this is all merged, I won't need to be involved, and you and the NVMe folks can hash that out. If there are updates to "nvmet: Optionally use PCI P2P memory" based on Sagi's comments, send an incremental patch and I'll fold them in. Bjorn