Re: [PATCH v9 00/13] Copy Offload in NVMe Fabrics with P2P PCI Memory

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Oct 04, 2018 at 03:27:34PM -0600, Logan Gunthorpe wrote:
> This is v9 for the p2pdma patch set. It has some substantial changes
> from the previous version. Essentially, the last patch has changed
> based on information from Sagi that the P2P memory can be assigned
> per namespace instead of globally. To that end, I've added a
> radix tree that stores the p2p devices to use for each namespace and
> a lookup will be done on that before allocating the memory.
> 
> This has some knock on effects of simplifying the requirements
> from the p2pdma code and so we drop all the client list code seeing
> we don't need to worry about selecting a P2P device that works
> with every namespace at once; only each namespace independantly.
> Thus, Patch 1, 6 and 13 have significant changes. However,
> I think the removal of the client list code will be generally
> appreciated based on the feedback I've gotten from Christian.
> 
> As usual, a git repo of the branch is available here:
> 
> https://github.com/sbates130272/linux-p2pmem pci-p2p-v9
> 
> Thanks,
> 
> Logan
> 
> --
> 
> Changes in v9:
> * Rebased on v4.19-rc6
> * Droped pci_p2pdma_add_client(), pci_p2pdma_remove_client() and
>   pci_p2pdma_client_list_free(), and reworked pci_p2pdma_distance() and
>   pci_p2pmem_find() to take simple arrays instead of list_heads.
> * Updated the documentation to match
> * Reworked the nvmet code to have a configfs attribute per namespace
>   instead of per port. Then each namespace has it's own P2P device
>   stored in a radix-tree in the controller (it can't be stored in
>   the nvme_ns structure seeing it needs to be unique per controller).
> * Dropped the 'sq' argument in nvmet_req_alloc_sgl() seeing I've noticed
>   it is now available in the nvmet_req structure.
> * Collected Keith's reviewed-by tags (however, I have not applied his
>   tag to the last patch seeing I think it has changed too much since
>   his review).
> 
> Changes in v8:
> 
> * Added a couple of comments to address Bart's feedback and
>   removed the bogus call to percpu_ref_switch_to_atomic_sync()
> 
> Changes in v7:
> 
> * Rebased on v4.19-rc5
> 
> * Fixed up commit message of patch 7 that was no longer accurate. (as
>   pointed out by Jens)
> * Change the configfs to not use "auto" or "none" and instead just
>   use a 0/1/<pci_dev> (or boolean). This matches the existing
>   nvme-target configfs booleans. (Per Bjorn)
> * A handful of other minor changes and edits that were noticed by Bjorn
> * Collected Acks from Bjorn
> 
> Changes in v6:
> 
> * Rebased on v4.19-rc3
> 
> * Remove the changes to the common submit_bio() path and instead
>   set REQ_NOMERGE in the NVME target driver, when appropriate.
>   Per discussions with Jens and Christoph.
> 
> * Some minor grammar changes in the documentation as spotted by Randy.
> 
> Changes in v5:
> 
> * Rebased on v4.19-rc1
> 
> * Drop changing ACS settings in this patchset. Now, the code
>   will only allow P2P transactions between devices whos
>   downstream ports do not restrict P2P TLPs.
> 
> * Drop the REQ_PCI_P2PDMA block flag and instead use
>   is_pci_p2pdma_page() to tell if a request is P2P or not. In that
>   case we check for queue support and enforce using REQ_NOMERGE.
>   Per feedback from Christoph.
> 
> * Drop the pci_p2pdma_unmap_sg() function as it was empty and only
>   there for symmetry and compatibility with dma_unmap_sg. Per feedback
>   from Christoph.
> 
> * Split off the logic to handle enabling P2P in NVMe fabrics' configfs
>   into specific helpers in the p2pdma code. Per feedback from Christoph.
> 
> * A number of other minor cleanups and fixes as pointed out by
>   Christoph and others.
> 
> Changes in v4:
> 
> * Change the original upstream_bridges_match() function to
>   upstream_bridge_distance() which calculates the distance between two
>   devices as long as they are behind the same root port. This should
>   address Bjorn's concerns that the code was to focused on
>   being behind a single switch.
> 
> * The disable ACS function now disables ACS for all bridge ports instead
>   of switch ports (ie. those that had two upstream_bridge ports).
> 
> * Change the pci_p2pmem_alloc_sgl() and pci_p2pmem_free_sgl()
>   API to be more like sgl_alloc() in that the alloc function returns
>   the allocated scatterlist and nents is not required bythe free
>   function.
> 
> * Moved the new documentation into the driver-api tree as requested
>   by Jonathan
> 
> * Add SGL alloc and free helpers in the nvmet code so that the
>   individual drivers can share the code that allocates P2P memory.
>   As requested by Christoph.
> 
> * Cleanup the nvmet_p2pmem_store() function as Christoph
>   thought my first attempt was ugly.
> 
> * Numerous commit message and comment fix-ups
> 
> Changes in v3:
> 
> * Many more fixes and minor cleanups that were spotted by Bjorn
> 
> * Additional explanation of the ACS change in both the commit message
>   and Kconfig doc. Also, the code that disables the ACS bits is surrounded
>   explicitly by an #ifdef
> 
> * Removed the flag we added to rdma_rw_ctx() in favour of using
>   is_pci_p2pdma_page(), as suggested by Sagi.
> 
> * Adjust pci_p2pmem_find() so that it prefers P2P providers that
>   are closest to (or the same as) the clients using them. In cases
>   of ties, the provider is randomly chosen.
> 
> * Modify the NVMe Target code so that the PCI device name of the provider
>   may be explicitly specified, bypassing the logic in pci_p2pmem_find().
>   (Note: it's still enforced that the provider must be behind the
>    same switch as the clients).
> 
> * As requested by Bjorn, added documentation for driver writers.
> 
> 
> Changes in v2:
> 
> * Renamed everything to 'p2pdma' per the suggestion from Bjorn as well
>   as a bunch of cleanup and spelling fixes he pointed out in the last
>   series.
> 
> * To address Alex's ACS concerns, we change to a simpler method of
>   just disabling ACS behind switches for any kernel that has
>   CONFIG_PCI_P2PDMA.
> 
> * We also reject using devices that employ 'dma_virt_ops' which should
>   fairly simply handle Jason's concerns that this work might break with
>   the HFI, QIB and rxe drivers that use the virtual ops to implement
>   their own special DMA operations.
> 
> --
> 
> This is a continuation of our work to enable using Peer-to-Peer PCI
> memory in the kernel with initial support for the NVMe fabrics target
> subsystem. Many thanks go to Christoph Hellwig who provided valuable
> feedback to get these patches to where they are today.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVMe target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU).
> 
> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch hierarchy. This will mean many setups that
> could likely work well will not be supported so that we can be more
> confident it will work and not place any responsibility on the user to
> understand their topology. (We chose to go this route based on feedback
> we received at the last LSF). Future work may enable these transfers
> using a white list of known good root complexes. However, at this time,
> there is no reliable way to ensure that Peer-to-Peer transactions are
> permitted between PCI Root Ports.
> 
> For PCI P2P DMA transfers to work in this situation the ACS bits
> must be disabled on the downstream ports (DSPs) for all devices
> involved in the transfer. This can be done using the "disable_acs_redir"
> PCI kernel command line option which was introduced in v4.19.
> 
> In order to enable PCI P2P functionality, we introduce a few new PCI
> functions such that a driver can register P2P memory with the system.
> Struct pages are created for this memory using devm_memremap_pages()
> and the PCI bus offset is stored in the corresponding pagemap structure.
> 
> Another set of functions allow a client driver to create a list of
> client devices that will be used in a given P2P transactions and then
> use that list to find any P2P memory that is supported by all the
> client devices.
> 
> In the block layer, we also introduce a flag for a request queue
> to indicate a given queue supports targeting P2P memory. The driver
> submitting bios must ensure that the queue supports P2P before
> attempting to submit BIOs backed by P2P memory. Also, P2P requests
> are marked to not be merged seeing a non-homogenous request would
> complicate the DMA mapping requirements.
> 
> In the PCI NVMe driver, we modify the existing CMB support to utilize
> the new PCI P2P memory infrastructure and also add support for P2P
> memory in its request queue. When a P2P request is received it uses the
> pci_p2pmem_map_sg() function which applies the necessary transformation
> to get the corrent pci_bus_addr_t for the DMA transactions.
> 
> In the RDMA core, we also adjust rdma_rw_ctx_init() and
> rdma_rw_ctx_destroy() to take a flags argument which indicates whether
> to use the PCI P2P mapping functions or not. To avoid odd RDMA devices
> that don't use the proper DMA infrastructure this code rejects using
> any device that employs the virt_dma_ops implementation.
> 
> Finally, in the NVMe fabrics target port we introduce a new
> configuration attribute: 'p2pmem'. When set to a true boolean, the port
> will attempt to find P2P memory supported by the RDMA NIC and all namespaces.
> It may also be set to a PCI device name to select a specific P2P
> memory to use. If supported memory is found, it will be used in all IO
> transfers. And if a port is using P2P memory, adding new namespaces that
> are not supported by that memory will fail.
> 
> These patches have been tested on a number of Intel based systems and
> for a variety of RDMA NICs (Mellanox, Broadcomm, Chelsio) and NVMe
> SSDs (Intel, Seagate, Samsung) and p2pdma devices (Eideticom,
> Microsemi, Chelsio and Everspin) using switches from both Microsemi
> and Broadcomm.
> 
> --
> 
> Logan Gunthorpe (13):
>   PCI/P2PDMA: Support peer-to-peer memory
>   PCI/P2PDMA: Add sysfs group to display p2pmem stats
>   PCI/P2PDMA: Add PCI p2pmem DMA mappings to adjust the bus offset
>   PCI/P2PDMA: Introduce configfs/sysfs enable attribute helpers
>   docs-rst: Add a new directory for PCI documentation
>   PCI/P2PDMA: Add P2P DMA driver writer's documentation
>   block: Add PCI P2P flag for request queue and check support for
>     requests
>   IB/core: Ensure we map P2P memory correctly in
>     rdma_rw_ctx_[init|destroy]()
>   nvme-pci: Use PCI p2pmem subsystem to manage the CMB
>   nvme-pci: Add support for P2P memory in requests
>   nvme-pci: Add a quirk for a pseudo CMB
>   nvmet: Introduce helper functions to allocate and free request SGLs
>   nvmet: Optionally use PCI P2P memory
> 
>  Documentation/ABI/testing/sysfs-bus-pci    |  24 +
>  Documentation/driver-api/index.rst         |   2 +-
>  Documentation/driver-api/pci/index.rst     |  21 +
>  Documentation/driver-api/pci/p2pdma.rst    | 145 ++++
>  Documentation/driver-api/{ => pci}/pci.rst |   0
>  drivers/infiniband/core/rw.c               |  11 +-
>  drivers/nvme/host/core.c                   |   4 +
>  drivers/nvme/host/nvme.h                   |   8 +
>  drivers/nvme/host/pci.c                    | 121 +--
>  drivers/nvme/target/configfs.c             |  45 ++
>  drivers/nvme/target/core.c                 | 180 +++++
>  drivers/nvme/target/io-cmd-bdev.c          |   3 +
>  drivers/nvme/target/nvmet.h                |  17 +
>  drivers/nvme/target/rdma.c                 |  22 +-
>  drivers/pci/Kconfig                        |  17 +
>  drivers/pci/Makefile                       |   1 +
>  drivers/pci/p2pdma.c                       | 809 +++++++++++++++++++++
>  include/linux/blkdev.h                     |   3 +
>  include/linux/memremap.h                   |   6 +
>  include/linux/mm.h                         |  18 +
>  include/linux/pci-p2pdma.h                 | 114 +++
>  include/linux/pci.h                        |   4 +
>  22 files changed, 1521 insertions(+), 54 deletions(-)
>  create mode 100644 Documentation/driver-api/pci/index.rst
>  create mode 100644 Documentation/driver-api/pci/p2pdma.rst
>  rename Documentation/driver-api/{ => pci}/pci.rst (100%)
>  create mode 100644 drivers/pci/p2pdma.c
>  create mode 100644 include/linux/pci-p2pdma.h

I added the reviewed-by tags from Christoph, Jens' ack on the blkdev.h
change, and applied these to pci/peer-to-peer with the intent of
merging these for v4.20.

I gave up on waiting for an ack for the memremap.h and mm.h changes.

I dropped the "nvme-pci: Add a quirk for a pseudo CMB" quirk because
of Christoph's objection.  After this is all merged, I won't need to
be involved, and you and the NVMe folks can hash that out.

If there are updates to "nvmet: Optionally use PCI P2P memory" based
on Sagi's comments, send an incremental patch and I'll fold them in.

Bjorn



[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux