[PATCH 0/2] pci/switch_discovery: Add new module to discover inter-switch P2P links

Shivasharan S <shivasharan.srikanteshwara@xxxxxxxxxxxx> · Wed, 12 Jun 2024 04:27:34 -0700

A. Introductory definitions:

Virtual Switch: Broadcom(PLX) switches have a capability where a
single physical switch can be divided up into N number of virtual
switches at SOD. For example, a single physical switch with 64 ports
can be configured to appear to the host as 2 switches with 32 ports
each. This is a static configuration that needs to be done before the
switch boots, and cannot generally be changed on the fly. Now consider
a GPU in Virtual switch 1 and a NIC on Virtual switch 2. The key here
is that it's actually the same switch, and IF P2P is enabled between
the two virtual switches, then that would be almost infinite bandwidth
between the GPU and the NIC. However, today there is no way for the
host to know that, and host applications believe that any data
exchange between the GPU and NIC must go through host root port and
thus would be slow. Note: Any such P2P must follow ACS/IOMMU rules,
and has to be enabled in the Broadcom switches.

Inter Switch Link: While the current use-case is about the virtual
switch config above, this could also extend to physical switch, where
the two physical switches have, say, a x16 PCIe connection between
them.

B: Goal/Problem statement:

Goal 1: Summary: Provide user applications a means by which they can
discover two virtual switches to be part of the same physical switch
or when physical switches are physically connected to each other, so
that they can discover optimized data path for HPC/AI applications.

With the rapid progression of High Performance Computing (HPC) and
Artificial Intelligence (AI), it is becoming more and more common to
have complex topologies with multiple GPU, NIC, NVMe devices etc
interconnected using multiple switches.   HPC and AI libraries like
MPI, UCC, NCCL,RCCL, HCCL etc analyze this topology to build a
topology tree to optimize data path for collective operations like
all-reduce etc .

Example:

                             Host root bridge
                ---------------------------------------
                |                  |                  |
  NIC1 --- PCI Switch1        PCI Switch2        PCI Switch3 --- NIC2
                |                  |                  |
               GPU1 ------------- GPU2 ------------- GPU3

                               SERVER 1

In the simple picture above in Server1, Switch1, Switch2, Switch3
are all connected to the host bridge and each switch has a GPU
connected, and Switch1/3 each has a NIC connected.
In a typical AI setup, there are many such servers, each connected by
upper level network switch, and "rail optimized", ie, NIC1 of all
servers are connected to Ethernet Switch1, NIC2 connected to Ethernet
Switch2 etc (Ethernet switches are not shown in picture above)
The GPUs are connected among themselves by some backend fabric, like
NVLINK (NVIDIA).
Assume that in the above diagram, PCI Switch1  and PCI Switch3 are
virtual switches belonging to the same physical switch and thus a very
high speed data link exists between them, but today host applications
have no knowledge about that.
(This is a very simple example, and modern AI infrastructure can be
way more complex than that.)

Now for collective operations like all-reduce, the HPC/AI libraries
analyze the topology above and typically decide on a data path like
this: NIC1->GPU1->GPU2->GPU3-> NIC2 which is suboptimal, because
ideally data should come go in and out through the same NIC because of
"rail optimized" topology.
Some libraries do this:NIC1->GPU1->GPU2->GPU3-> GPU1->NIC1.
The applications do the above because they think data from GPU3 to
NIC1  needs to go through the host root port, which is very
inefficient. What they do not know is that Switch1 and Switch3 are the
same physical entity with virtually infinite bandwidth between them,
and with that, they would have chosen a path like:
NIC1->GPU1->GPU2->GPU3->NIC1, which is the most optimized in the above
example.

Goal 2: Extend Linux P2PDMA distance function pci_p2pdma_distance to
account for Virtual Switch and physical switches connected by inter
switch link. The current implementation of the function has no
knowledge of Virtual switch and inter switch link.
Consider the example below:

     -+  Root Port
      \+ Switch1 Upstream Port
         +-+ Switch1 Downstream Port 0
          \- Device A
      \+ Switch2 Upstream Port
         +-+ Switch2 Downstream Port 0
           \- Device B

Suppose Switch1 and Switch2 are virtual switches belonging  to the
same physical switch. Today P2PDMA distance between Device A and
Device B  will return PCI_P2PDMA_MAP_THRU_HOST_BRIDGE, as kernel has
no idea that switch1 and switch2 are actually physically connected to
each other.
We intend to fix that, so that pci_p2pdma_distance now takes into
account switch connectivity information.

C. FAQs

FAQ 1:  How does this feature work with ACS/IOMMU?
This feature does NOT add any new connectivity.  The inter-switch
/virtual switch connections already follow all ACS/IOMMU rules, and
only if allowed by ACS settings, they allow for data to follow a
shortcut connection between switches and bypass the root port. The
only thing this module does is provide the switch connection
information to application software and pci_p2pdma_distance clients,
so that they can make intelligent decisions for the data path.

FAQ 2:  Is this feature Broadcom specific and will it work for other
vendors?
The current implementation of the kernel module looks at Broadcom
Vendor specific extensions to determine if switch p2p is enabled.
Thus, the current implementation works only on Broadcom switches. That
being said, other vendors are free to extend/modify the code to
support their switch. The function names, code structure and sysfs path
that exposes the pci switch p2p is on purpose made generic, to allow for
extension of support to other vendors.

FAQ 3: Why can't applications read the Broadcom vendor specific
information directly from the config space? Why do we need the sysfs
path?
The vendor specific section of PCIe config space is not readable by
applications running in non-root mode, as such applications can only
read the first few bytes of the config space. Besides, reading the
vendor specific config space will not make the solution generic.

FAQ 4: Will applications still use the standard P2P model of
registering the provider, client etc?
Absolutely. All existing p2p api will work as is. All that this module
provides is information that a fast connection exists between switches
 and/or pci endpoints. To make the actual p2p DMA, application need
use existing p2p API and follow existing ACS/IOMMU rules

FAQ 5: Why can't we only modify the existing pci_p2pdma_distance
function, and expose a p2pdistance to userspace? Why do we need the
new sysfs entries for pci switch connectivity?
The existing HPC/AI libraries like MPI, UCC, NCCL,RCCL, HCCL etc work
not only with PCIe switches, but also with other kind of connectivity,
like TCP, network switches, infiniband and backend inter GPU
connectivity like NVLINK and AFL. Because of that, the libraries have
matured code that analyzes all the connections and entire topology to
determine the most optimal data path among nodes. Just using
pci_p2pdma_distance does not work for them, because there might be a
shorter path between two nodes using NVLINK or a network switch.  In
theory those libraries could be modified to use pci_p2pdma_distance
for PCIe connection and other method for other connection, but in
practice that is near impossible, as those changes are very intrusive
and those libraries have matured for a long time,. Their respective
maintainers are highly reluctant to make such a big change and rather
get only the missing information, that is whether two switches are
connected together. Broadcom has received such first hand feedback.
Forcing everyone to use p2pdistance only will defeat the whole purpose
of this module. However, we do want to support those libraries that
want to use pci_p2pdma_distance, and that is why we are extending
pci_p2pdma_distance function too. Thus, our goal here is to enable
existing libraries to get only the information they need, while having
means for new code or more flexible code to use pci_p2pdma_distance as
needed.

Shivasharan S (2):
  switch_discovery: Add new module to discover inter switch links
    between PCI-to-PCI bridges
  pci/p2pdma: Modify p2p_dma_distance to detect virtual P2P links

 .../driver-api/pci/switch_discovery.rst       |  52 +++
 MAINTAINERS                                   |  13 +
 drivers/pci/Kconfig                           |   1 +
 drivers/pci/p2pdma.c                          |  18 +-
 drivers/pci/switch/Kconfig                    |   9 +
 drivers/pci/switch/Makefile                   |   1 +
 drivers/pci/switch/switch_discovery.c         | 375 ++++++++++++++++++
 drivers/pci/switch/switch_discovery.h         |  44 ++
 8 files changed, 512 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/driver-api/pci/switch_discovery.rst
 create mode 100644 drivers/pci/switch/switch_discovery.c
 create mode 100644 drivers/pci/switch/switch_discovery.h

-- 
2.43.0

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature