[PATCH 1/2] switch_discovery: Add new module to discover inter switch links between PCI-to-PCI bridges

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This kernel module discovers the virtual inter-switch P2P links present
between two PCI-to-PCI bridges that allows an optimal data path for data
movement. The module creates sysfs entries for upstream PCI-to-PCI
bridges which supports the inter switch P2P links as below:

                             Host root bridge
                ---------------------------------------
                |                                     |
  NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2
(af:00.0)   (ad:00.0)                             (8b:00.0)   (8d:00.0)
                |                                     |
               GPU1                                  GPU2
            (b0:00.0)                             (8e:00.0)
                               SERVER 1

/sys/kernel/pci_switch_link/virtual_switch_links
├── 0000:8b:00.0
│   └── 0000:ad:00.0 -> ../0000:ad:00.0
└── 0000:ad:00.0
    └── 0000:8b:00.0 -> ../0000:8b:00.0

Signed-off-by: Shivasharan S <shivasharan.srikanteshwara@xxxxxxxxxxxx>
Signed-off-by: Sumanesh Samanta <sumanesh.samanta@xxxxxxxxxxxx>
---
 .../driver-api/pci/switch_discovery.rst       |  52 +++
 MAINTAINERS                                   |  13 +
 drivers/pci/switch/Kconfig                    |   9 +
 drivers/pci/switch/Makefile                   |   1 +
 drivers/pci/switch/switch_discovery.c         | 375 ++++++++++++++++++
 drivers/pci/switch/switch_discovery.h         |  44 ++
 6 files changed, 494 insertions(+)
 create mode 100644 Documentation/driver-api/pci/switch_discovery.rst
 create mode 100644 drivers/pci/switch/switch_discovery.c
 create mode 100644 drivers/pci/switch/switch_discovery.h

diff --git a/Documentation/driver-api/pci/switch_discovery.rst b/Documentation/driver-api/pci/switch_discovery.rst
new file mode 100644
index 000000000000..7c1476260e5e
--- /dev/null
+++ b/Documentation/driver-api/pci/switch_discovery.rst
@@ -0,0 +1,52 @@
+=================================
+Linux PCI Switch discovery module
+=================================
+
+Modern PCI switches support inter switch Peer-to-Peer(P2P) data transfer
+without using host resources. For example, Broadcom(PLX) PCIe Switches have a
+capability where a single physical switch can be divided up into multiple
+virtual switches at SOD. PCIe switch discovery module detects the virtual links
+between the switches that belong to the same physical switch.
+This allows user space applications to discover these virtual links that belong
+to the same physical switch and configure optimized data paths.
+
+Userspace Interface
+===================
+
+The module exposes sysfs entries for user space applications like MPI, NCCL,
+UCC, RCCL, HCCL, etc to discover the virtual switch links.
+
+Consider the below topology
+
+                             Host root bridge
+                ---------------------------------------
+                |                                     |
+  NIC1 --- PCI Switch1 --- Inter-switch link --- PCI Switch2 --- NIC2
+(af:00.0)   (ad:00.0)                             (8b:00.0)   (8d:00.0)
+                |                                     |
+               GPU1                                  GPU2
+            (b0:00.0)                             (8e:00.0)
+                               SERVER 1
+
+The simple topology above shows SERVER1, has Switch1 and Switch2 which are
+virtual switches that belong to the same physical switch that support
+Inter switch P2P.
+Switch1 and Switch2 have a GPU and NIC each connected.
+The module will detect the virtual P2P link existing between the two switches
+and create the sysfs entries as below.
+
+/sys/kernel/pci_switch_link/virtual_switch_links
+├── 0000:8b:00.0
+│   └── 0000:ad:00.0 -> ../0000:ad:00.0
+└── 0000:ad:00.0
+    └── 0000:8b:00.0 -> ../0000:8b:00.0
+
+The HPC/AI libraries that analyze the topology can decide the optimal data
+path like: NIC1->GPU1->GPU2->NIC1 which would have otherwise take a
+non-optimal path like NIC1->GPU1->GPU2->GPU1->NIC1.
+
+Enable P2P DMA to discover virtual links
+----------------------------------------
+The module also enhances :c:func:`pci_p2pdma_distance()` to determine a virtual
+link between the upstream PCI-to-PCI bridges of the devices and detect optimal
+path for applications using P2P DMA API.
diff --git a/MAINTAINERS b/MAINTAINERS
index 823387766a0c..b1bf3533ea6f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17359,6 +17359,19 @@ F:	Documentation/driver-api/pci/p2pdma.rst
 F:	drivers/pci/p2pdma.c
 F:	include/linux/pci-p2pdma.h
 
+PCI SWITCH DISCOVERY
+M:	Shivasharan S <shivasharan.srikanteshwara@xxxxxxxxxxxx>
+M:	Sumanesh Samanta <sumanesh.samanta@xxxxxxxxxxxx>
+L:	linux-pci@xxxxxxxxxxxxxxx
+S:	Maintained
+Q:	https://patchwork.kernel.org/project/linux-pci/list/
+B:	https://bugzilla.kernel.org
+C:	irc://irc.oftc.net/linux-pci
+T:	git git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git
+F:	Documentation/driver-api/pci/switch_discovery.rst
+F:	drivers/pci/switch/switch_discovery.c
+F:	drivers/pci/switch/switch_discovery.h
+
 PCI SUBSYSTEM
 M:	Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
 L:	linux-pci@xxxxxxxxxxxxxxx
diff --git a/drivers/pci/switch/Kconfig b/drivers/pci/switch/Kconfig
index d370f4ce0492..fb4410153950 100644
--- a/drivers/pci/switch/Kconfig
+++ b/drivers/pci/switch/Kconfig
@@ -12,4 +12,13 @@ config PCI_SW_SWITCHTEC
 	 devices. See <file:Documentation/driver-api/switchtec.rst> for more
 	 information.
 
+config PCI_SW_DISCOVERY
+	depends on PCI
+	tristate "PCI Switch discovery module"
+	help
+	 This kernel module discovers the PCI-to-PCI bridges of PCIe switches
+	 and forms the virtual switch links if the bridges belong to the same
+	 Physical switch. The switch links help to identify shorter distances
+	 for P2P configurations.
+
 endmenu
diff --git a/drivers/pci/switch/Makefile b/drivers/pci/switch/Makefile
index acd56d3b4a35..a3584b5146af 100644
--- a/drivers/pci/switch/Makefile
+++ b/drivers/pci/switch/Makefile
@@ -1,2 +1,3 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_PCI_SW_SWITCHTEC) += switchtec.o
+obj-$(CONFIG_PCI_SW_DISCOVERY) += switch_discovery.o
diff --git a/drivers/pci/switch/switch_discovery.c b/drivers/pci/switch/switch_discovery.c
new file mode 100644
index 000000000000..a427d3885b1f
--- /dev/null
+++ b/drivers/pci/switch/switch_discovery.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ *  PCI Switch Discovery module
+ *
+ *  Copyright (c) 2024  Broadcom Inc.
+ *
+ *  Authors: Broadcom Inc.
+ *           Sumanesh Samanta <sumanesh.samanta@xxxxxxxxxxxx>
+ *           Shivasharan S <shivasharan.srikanteshwara@xxxxxxxxxxxx>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/sysfs.h>
+#include <linux/slab.h>
+#include <linux/rwsem.h>
+#include <linux/pci.h>
+#include <linux/vmalloc.h>
+#include "switch_discovery.h"
+
+static DECLARE_RWSEM(sw_disc_rwlock);
+static struct kobject *sw_disc_kobj, *sw_link_kobj;
+static struct kobject *sw_kobj[SWD_MAX_VIRT_SWITCH];
+static DECLARE_BITMAP(swdata_valid, SWD_MAX_VIRT_SWITCH);
+
+static struct switch_data *swdata;
+
+static int sw_disc_probe(void);
+static int sw_disc_create_sysfs_files(void);
+static bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num);
+
+static inline bool sw_disc_is_supported_pdev(struct pci_dev *pdev)
+{
+	if ((pdev->vendor == PCI_VENDOR_ID_LSI) &&
+	   ((pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_HLC) ||
+	    (pdev->device == PCI_DEVICE_ID_BRCM_PEX_89000_LLC)))
+		return true;
+
+	return false;
+}
+
+static ssize_t sw_disc_show(struct kobject *kobj,
+			struct kobj_attribute *attr,
+			char *buf)
+{
+	int retval;
+
+	down_write(&sw_disc_rwlock);
+	retval = sw_disc_probe();
+	if (!retval) {
+		pr_debug("No new switch found\n");
+		goto exit_success;
+	}
+
+	retval = sw_disc_create_sysfs_files();
+	if (retval < 0) {
+		pr_err("Failed to create the sysfs entries, retval %d\n",
+		       retval);
+	}
+
+exit_success:
+	up_write(&sw_disc_rwlock);
+	return sysfs_emit(buf, SWD_SCAN_DONE);
+}
+
+/* This function probes the PCIe devices for virtual links */
+static int sw_disc_probe(void)
+{
+	int i, bit;
+	struct pci_dev *pdev = NULL;
+	int topology_changed = 0;
+	DECLARE_BITMAP(sw_found_map, SWD_MAX_VIRT_SWITCH);
+
+	while ((pdev = pci_get_device(PCI_ANY_ID, PCI_ANY_ID, pdev)) != NULL) {
+		int sw_found;
+
+		/* Currently this function only traverses Broadcom
+		 * PEX switches and determines the virtual SW links.
+		 * Other Switch vendors can add their specific logic
+		 * determine the virtual links.
+		 */
+		if (!sw_disc_is_supported_pdev(pdev))
+			continue;
+
+		sw_found = -1;
+
+		for_each_set_bit(bit, swdata_valid, SWD_MAX_VIRT_SWITCH) {
+			if (swdata[bit].devfn == pdev->devfn &&
+			    swdata[bit].bus == pdev->bus) {
+				sw_found = bit;
+				set_bit(sw_found, sw_found_map);
+				break;
+			}
+		}
+
+		if (sw_found != -1)
+			continue;
+
+		for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++)
+			if (!swdata[i].bus)
+				break;
+
+		if (i >= SWD_MAX_VIRT_SWITCH) {
+			pr_err("Max switch exceeded\n");
+			break;
+		}
+
+		sw_found = i;
+
+		if (!brcm_sw_is_p2p_supported(pdev, (char *)&swdata[sw_found].serial_num))
+			continue;
+
+		/* Found a new switch which supports P2P */
+		swdata[sw_found].devfn = pdev->devfn;
+		swdata[sw_found].bus = pdev->bus;
+
+		topology_changed = 1;
+		set_bit(sw_found, sw_found_map);
+		set_bit(sw_found, swdata_valid);
+	}
+
+	/* handle device removal */
+	for_each_clear_bit(bit, sw_found_map, SWD_MAX_VIRT_SWITCH) {
+		if (test_bit(bit, swdata_valid)) {
+			memset(&swdata[bit], 0, sizeof(swdata[i]));
+			clear_bit(bit, swdata_valid);
+			topology_changed = 1;
+		}
+	}
+
+	return topology_changed;
+}
+
+/* Check the various config space registers of the Broadcom PCI device and
+ * return true if the device supports inter switch P2P.
+ * If P2P is supported, return the device serial number back to
+ * caller.
+ */
+bool brcm_sw_is_p2p_supported(struct pci_dev *pdev, char *serial_num)
+{
+	int base;
+	u32 cap_data1, cap_data2;
+	u16 vsec;
+	u32 vsec_data;
+
+	base = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_DSN);
+	if (!base) {
+		pr_debug("Failed to get extended capability bus %x devfn %x\n",
+			 pdev->bus->number, pdev->devfn);
+		return false;
+	}
+
+	vsec = pci_find_vsec_capability(pdev, PCI_VENDOR_ID_LSI, 1);
+	if (!vsec) {
+		pr_debug("Failed to get VSEC bus %x devfn %x\n",
+			 pdev->bus->number, pdev->devfn);
+		return false;
+	}
+
+	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM)
+		return false;
+
+	pci_read_config_dword(pdev, base + 8, &cap_data1);
+	pci_read_config_dword(pdev, base + 4, &cap_data2);
+
+	pci_read_config_dword(pdev, vsec + 12, &vsec_data);
+
+	pr_debug("Found Broadcom device bus 0x%x devfn 0x%x "
+		 "Serial Number: 0x%x 0x%x, VSEC 0x%x\n",
+		 pdev->bus->number, pdev->devfn,
+		 cap_data1, cap_data2, vsec_data);
+
+	if (!SECURE_PART(cap_data1))
+		return false;
+
+	if (!(P2PMASK(vsec_data) & INTER_SWITCH_LINK))
+		return false;
+
+	if (serial_num)
+		snprintf(serial_num, SWD_MAX_CHAR, "%x%x", cap_data1, cap_data2);
+
+	return true;
+}
+
+static int sw_disc_create_sysfs_files(void)
+{
+	int i, j, retval;
+
+	for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) {
+		if (sw_kobj[i]) {
+			kobject_put(sw_kobj[i]);
+			sw_kobj[i] = NULL;
+		}
+	}
+
+	if (sw_link_kobj) {
+		kobject_put(sw_link_kobj);
+		sw_link_kobj = NULL;
+	}
+
+	sw_link_kobj = kobject_create_and_add(SWD_LINK_DIR_STRING, sw_disc_kobj);
+	if (!sw_link_kobj) {
+		pr_err("Failed to create pci link object\n");
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) {
+		int segment, bus, device, function;
+		char bdf_i[SWD_MAX_CHAR];
+
+		if (!test_bit(i, swdata_valid))
+			continue;
+
+		segment = pci_domain_nr(swdata[i].bus);
+		bus = swdata[i].bus->number;
+		device = pci_ari_enabled(swdata[i].bus) ?
+				0 : PCI_SLOT(swdata[i].devfn);
+		function = pci_ari_enabled(swdata[i].bus) ?
+				swdata[i].devfn : PCI_FUNC(swdata[i].devfn);
+		sprintf(bdf_i, "%04x:%02x:%02x.%x",
+			segment, bus, device, function);
+
+		for (j = i + 1; j < SWD_MAX_VIRT_SWITCH; j++) {
+			char bdf_j[SWD_MAX_CHAR];
+
+			if (!test_bit(j, swdata_valid))
+				continue;
+			segment = pci_domain_nr(swdata[j].bus);
+			bus = swdata[j].bus->number;
+			device = pci_ari_enabled(swdata[j].bus) ?
+					0 : PCI_SLOT(swdata[j].devfn);
+			function = pci_ari_enabled(swdata[j].bus) ?
+					swdata[j].devfn : PCI_FUNC(swdata[j].devfn);
+			sprintf(bdf_j, "%04x:%02x:%02x.%x",
+				segment, bus, device, function);
+
+			if (strcmp(swdata[i].serial_num, swdata[j].serial_num) == 0) {
+				if (!sw_kobj[i]) {
+					sw_kobj[i] = kobject_create_and_add(bdf_i,
+									    sw_link_kobj);
+					if (!sw_kobj[i]) {
+						pr_err("Failed to create sysfs entry for switch %s\n",
+						       bdf_i);
+					}
+				}
+
+				if (!sw_kobj[j]) {
+					sw_kobj[j] = kobject_create_and_add(bdf_j,
+									    sw_link_kobj);
+					if (!sw_kobj[j]) {
+						pr_err("Failed to create sysfs entry for switch %s\n",
+						       bdf_j);
+					}
+				}
+
+				retval = sysfs_create_link(sw_kobj[i], sw_kobj[j], bdf_j);
+				if (retval)
+					pr_err("Error creating symlink %s and %s\n",
+					       bdf_i, bdf_j);
+
+				retval = sysfs_create_link(sw_kobj[j], sw_kobj[i], bdf_i);
+				if (retval)
+					pr_err("Error creating symlink %s and %s\n",
+					       bdf_j, bdf_i);
+			}
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Check if the two pci devices have virtual P2P link available.
+ * This function is used by the p2pdma to determine virtual
+ * links between the PCI-to-PCI bridges
+ */
+bool sw_disc_check_virtual_link(struct pci_dev *a,
+				 struct pci_dev *b)
+{
+	char serial_num_a[SWD_MAX_CHAR], serial_num_b[SWD_MAX_CHAR];
+
+	/*
+	 * Check if the PCIe devices support Virtual P2P links
+	 */
+	if (!sw_disc_is_supported_pdev(a))
+		return false;
+
+	if (!sw_disc_is_supported_pdev(b))
+		return false;
+
+	if (brcm_sw_is_p2p_supported(a, serial_num_a) &&
+	    brcm_sw_is_p2p_supported(b, serial_num_b))
+		if (!strcmp(serial_num_a, serial_num_b))
+			return true;
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(sw_disc_check_virtual_link);
+
+static struct kobj_attribute sw_disc_attribute =
+	__ATTR(SWD_FILE_NAME_STRING, 0444, sw_disc_show, NULL);
+
+// Create attribute group
+static struct attribute *attrs[] = {
+	&sw_disc_attribute.attr,
+	NULL,
+};
+
+static struct attribute_group attr_group = {
+	.attrs = attrs,
+};
+
+static int __init sw_discovery_init(void)
+{
+	int i, retval;
+
+	for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++)
+		sw_kobj[i] = NULL;
+
+	// Create "sw_disc" kobject
+	sw_disc_kobj = kobject_create_and_add(SWD_DIR_STRING, kernel_kobj);
+	if (!sw_disc_kobj) {
+		pr_err("Failed to create sw_disc_kobj\n");
+		return -ENOMEM;
+	}
+
+	retval = sysfs_create_group(sw_disc_kobj, &attr_group);
+	if (retval) {
+		pr_err("Cannot register sysfs attribute group\n");
+		kobject_put(sw_disc_kobj);
+	}
+
+	swdata = kzalloc(sizeof(swdata) * SWD_MAX_VIRT_SWITCH, GFP_KERNEL);
+	if (!swdata) {
+		sysfs_remove_group(sw_disc_kobj, &attr_group);
+		kobject_put(sw_disc_kobj);
+		return 0;
+	}
+
+	pr_info("Loading PCIe switch discovery module, version %s\n",
+		SWITCH_DISC_VERSION);
+
+	return 0;
+}
+
+static void __exit sw_discovery_exit(void)
+{
+	int i;
+
+	if (!swdata)
+		kfree(swdata);
+
+	for (i = 0; i < SWD_MAX_VIRT_SWITCH; i++) {
+		if (sw_kobj[i])
+			kobject_put(sw_kobj[i]);
+	}
+
+	// Remove kobject
+	if (sw_link_kobj)
+		kobject_put(sw_link_kobj);
+
+	sysfs_remove_group(sw_disc_kobj, &attr_group);
+	kobject_put(sw_disc_kobj);
+}
+
+module_init(sw_discovery_init);
+module_exit(sw_discovery_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Broadcom Inc.");
+MODULE_VERSION(SWITCH_DISC_VERSION);
+MODULE_DESCRIPTION("PCIe Switch Discovery Module");
diff --git a/drivers/pci/switch/switch_discovery.h b/drivers/pci/switch/switch_discovery.h
new file mode 100644
index 000000000000..b84f5d2e29ac
--- /dev/null
+++ b/drivers/pci/switch/switch_discovery.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ *  PCI Switch Discovery module
+ *
+ *  Copyright (c) 2024  Broadcom Inc.
+ *
+ *  Authors: Broadcom Inc.
+ *           Sumanesh Samanta <sumanesh.samanta@xxxxxxxxxxxx>
+ *           Shivasharan S <shivasharan.srikanteshwara@xxxxxxxxxxxx>
+ */
+
+#ifndef PCI_SWITCH_DISC_H
+#define PCI_SWITCH_DISC_H
+
+#define SWD_MAX_SWITCH		32
+#define SWD_MAX_VER_PER_SWITCH	8
+
+#define SWD_MAX_VIRT_SWITCH	(SWD_MAX_SWITCH * SWD_MAX_VER_PER_SWITCH)
+#define SWD_MAX_CHAR		16
+#define SWITCH_DISC_VERSION	"0.1.1"
+#define SWD_DIR_STRING		"pci_switch_link"
+#define SWD_LINK_DIR_STRING	"virtual_switch_links"
+#define SWD_SCAN_DONE		"done\n"
+
+#define SWD_FILE_NAME_STRING	refresh_switch_toplogy
+
+/* Broadcom Vendor Specific definitions */
+#define PCI_VENDOR_ID_LSI			0x1000
+#define PCI_DEVICE_ID_BRCM_PEX_89000_HLC	0xC030
+#define PCI_DEVICE_ID_BRCM_PEX_89000_LLC	0xC034
+
+#define P2PMASK(x)		(((x) & 0x300) >> 8)
+#define SECURE_PART(x)		((x) & 0x8)
+#define INTER_SWITCH_LINK	0x2
+
+struct switch_data {
+	int  devfn;
+	struct pci_bus *bus;
+	char serial_num[SWD_MAX_CHAR];
+};
+
+bool sw_disc_check_virtual_link(struct pci_dev *a, struct pci_dev *b);
+
+#endif /* PCI_SWITCH_DISC_H */
-- 
2.43.0

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature


[Index of Archives]     [DMA Engine]     [Linux Coverity]     [Linux USB]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [Greybus]

  Powered by Linux