[+cc Keith] On Fri, Dec 13, 2019 at 09:35:19AM +0100, Stefan Roese wrote: > Hi! > > I am facing an issue with PCIe-Hotplug on an AMD Epyc based system. > Our system is equipped with an HBA for NVMe SSDs incl. PCIe switch > (Supermicro AOC-SLG3-4E2P) [1] and we would like to be able to hotplug > NVMe disks. Your system has several host bridges. The address space routed to each host bridge is determined by firmware, and Linux has no support for changing it. Here's the space routed to the hierarchy containing the NVMe devices: ACPI: PCI Root Bridge [S0D2] (domain 0000 [bus 40-5f]) pci_bus 0000:40: root bus resource [mem 0xeb000000-0xeb5fffff window] 6MB pci_bus 0000:40: root bus resource [mem 0x7fc8000000-0xfcffffffff window] 501GB+ pci_bus 0000:40: root bus resource [bus 40-5f] Since you have several host bridges, using "pci=nocrs" is pretty much guaranteed to fail if Linux changes any PCI address assignments. It makes Linux *ignore* the routing information from firmware, but it doesn't *change* any of the routing. That's why experiment (d) fails: we assigned this space: pci 0000:44:00.0: BAR 0: assigned [mem 0xec000000-0xec003fff 64bit] but according to the BIOS, the [mem 0xec000000-0xefffffff window] area is routed to bus 00, not bus 40, so when we try to access that BAR, it goes to bus 00 where nothing responds. There are three devices on bus 40 that consume memory address space: 40:03.1 Root Port to [bus 41-47] [mem 0xeb400000-0xeb5fffff] 2MB 40:07.1 Root Port to [bus 48] [mem 0xeb200000-0xeb3fffff] 2MB 40:08.1 Root Port to [bus 49] [mem 0xeb000000-0xeb1fffff] 2MB Bridges (including Root Ports and Switch Ports) consume memory address space in 1MB chunks. The devices on buses 48 and 49 need a little over 1MB, so 40:07.1 and 40:08.1 need at least 2MB each. There's only 6MB available, so that leaves 2MB for 40:03.1, which leads to the PLX switch. That 2MB of memory space is routed to the PLX Switch Upstream Port, which has a BAR of its own that requires 256K, which leaves 1MB for it to send to its Downstream Ports. The Intel NVMe device only needs 16KB of memory space, but since the Switch Port windows are a minimum of 1MB, only one port gets memory space. So with this configuration, I think you're stuck. The only things I can think of are: - Put the PLX switch in a different slot to see if BIOS will assign more space to it (the other host bridges have more space available). - Boot with all four PLX slots occupied by NVMe devices. The BIOS may assign space to accommodate them all. If it does, you should be able to hot-remove and add devices after boot. - Change Linux to use prefetchable space. The Intel NVMe wants *non-prefetchable* space, but there's an implementation note in the spec (PCIe r5.0, sec 7.5.1.2.1) that says it should be safe to put it in prefetchable space in certain cases (entire path is PCIe, no PCI/PCI-X devices to peer-to-peer reads, host bridge does no byte merging, etc). The main problem is that we don't have a good way to identify these cases. > Currently, I'm testing with v5.5.0-rc1 and series [2] applied. Here > a few tests and results that I did so far. All tests were done with > one Intel NVMe SSD connected to one of the 4 NVMe ports of the HBA > and the other 3 ports (currently) left unconnected: > > a) Kernel Parameter "pci=pcie_bus_safe" > The resources of the 3 unused PCIe slots of the PEX switch are not > assigned in this test. > > b) Kernel Parameter "pci=pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > With this test I restricted the resources of the HP slots to the > minimum. Still this results in unassigned resourced for the unused > PCIe slots of the PEX switch. > > c) Kernel Parameter "pci=realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > Again, not all resources are assigned. > > d) Kernel Parameter "pci=nocrs,realloc,pcie_bus_safe,hpmemsize=0,hpiosize=0,hpmmiosize=1M,hpmmioprefsize=0" > Now all requested resources are available for the HP PCIe slots of the > PEX switch. But the NVMe driver fails while probing. Debugging has > shown, that reading from the BAR of the NVMe disk returns 0xffffffff. > Also reading from the PLX PEX switch registers returns 0xfffffff in this > case (this works of course without nocrs, when the BARs are mapped at > a different address). > > Does anybody have a clue on why the access to the PEX switch and / or > the NVMe BAR does not work in the "nocrs" case? The BARs are located in > the same window that is provided by the BIOS in the ACPI list (but is > "ignored" in this case) [3]. > > Or if it is possible to get the HP resource mapping done correctly without > setting "nocrs" for our setup with the PCIe/NVMe switch? > > [1] https://www.supermicro.com/en/products/accessories/addon/AOC-SLG3-4E2P.php > [2] https://lkml.org/lkml/2019/12/9/388 > [3] > [ 0.701932] acpi PNP0A08:00: host bridge window [io 0x0cf8-0x0cff] (ignored) > [ 0.701934] acpi PNP0A08:00: host bridge window [io 0x0000-0x02ff window] (ignored) > [ 0.701935] acpi PNP0A08:00: host bridge window [io 0x0300-0x03af window] (ignored) > [ 0.701936] acpi PNP0A08:00: host bridge window [io 0x03e0-0x0cf7 window] (ignored) > [ 0.701937] acpi PNP0A08:00: host bridge window [io 0x03b0-0x03df window] (ignored) > [ 0.701938] acpi PNP0A08:00: host bridge window [io 0x0d00-0x3fff window] (ignored) > [ 0.701939] acpi PNP0A08:00: host bridge window [mem 0x000a0000-0x000bffff window] (ignored) > [ 0.701939] acpi PNP0A08:00: host bridge window [mem 0x000c0000-0x000dffff window] (ignored) > [ 0.701940] acpi PNP0A08:00: host bridge window [mem 0xec000000-0xefffffff window] (ignored) > [ 0.701941] acpi PNP0A08:00: host bridge window [mem 0x182c8000000-0x1ffffffffff window] (ignored) > ... > 41:00.0 PCI bridge: PLX Technology, Inc. PEX 9733 33-lane, 9-port PCI Express Gen 3 (8.0 GT/s) Switch (rev b0) (prog-if 00 [Normal decode]) > Flags: bus master, fast devsel, latency 0, IRQ 47, NUMA node 2 > Memory at ec400000 (32-bit, non-prefetchable) [size=256K] > Bus: primary=41, secondary=42, subordinate=47, sec-latency=0 > I/O behind bridge: None > Memory behind bridge: ec000000-ec3fffff [size=4M] > Prefetchable memory behind bridge: None > Capabilities: <access denied> > Kernel driver in use: pcieport > epyc@epyc-Super-Server:~/stefan$ sudo ./memtool md 0xec400000+0x10 > ec400000: ffffffff ffffffff ffffffff ffffffff ................