On Tue, Sep 18, 2018 at 6:55 AM, Sergey Miroshnichenko <s.miroshnichenko@xxxxxxxxx> wrote: > Hello Sam, > > On 9/17/18 8:28 AM, Sam Bobroff wrote: >> Hi Sergey, >> >> On Fri, Sep 14, 2018 at 07:14:01PM +0300, Sergey Miroshnichenko wrote: >>> Introduce a new command line option "pci=pcie_movable_bars" that indicates >>> support of PCIe hotplug without prior reservation of memory regions by >>> BIOS/bootloader. >>> >>> If a new PCIe device has been hot-plugged between two active ones, which >>> have no (or not big enough) gap between their BARs, allocating new BARs >>> requires to move BARs of the following working devices: >>> >>> 1) dev 4 >>> | >>> v >>> .. | dev 3 | dev 3 | dev 5 | dev 7 | >>> .. | BAR 0 | BAR 1 | BAR 0 | BAR 0 | >>> >>> 2) dev 4 >>> | >>> v >>> .. | dev 3 | dev 3 | --> --> | dev 5 | dev 7 | >>> .. | BAR 0 | BAR 1 | --> --> | BAR 0 | BAR 0 | >>> >>> 3) >>> >>> .. | dev 3 | dev 3 | dev 4 | dev 4 | dev 5 | dev 7 | >>> .. | BAR 0 | BAR 1 | BAR 0 | BAR 1 | BAR 0 | BAR 0 | >>> >>> Not only BARs, but also bridge windows can be updated during a PCIe rescan, >>> threatening all memory transactions during this procedure, so the PCI >>> subsystem will instruct the drivers to pause by calling the reset_prepare() >>> and reset_done() callbacks. >>> >>> If a device may be affected by BAR movement, the BAR changes tracking must >>> be implemented in its driver. >>> >>> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@xxxxxxxxx> >>> --- >>> .../admin-guide/kernel-parameters.txt | 6 +++ >>> drivers/pci/pci.c | 2 + >>> drivers/pci/probe.c | 43 +++++++++++++++++++ >>> include/linux/pci.h | 1 + >>> 4 files changed, 52 insertions(+) >>> >>> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt >>> index 64a3bf54b974..f8132a709061 100644 >>> --- a/Documentation/admin-guide/kernel-parameters.txt >>> +++ b/Documentation/admin-guide/kernel-parameters.txt >>> @@ -3311,6 +3311,12 @@ >>> bridges without forcing it upstream. Note: >>> this removes isolation between devices and >>> may put more devices in an IOMMU group. >>> + pcie_movable_bars Arrange a space at runtime for BARs of >>> + hotplugged PCIe devices - usable if bootloader >>> + doesn't reserve memory regions for them. Freeing >>> + a space may require moving BARs of active devices >>> + to higher addresses, so device drivers will be >>> + paused during rescan. >>> >>> pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power >>> Management. >>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c >>> index 1835f3a7aa8d..5f07a59b5924 100644 >>> --- a/drivers/pci/pci.c >>> +++ b/drivers/pci/pci.c >>> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str) >>> pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); >>> } else if (!strncmp(str, "disable_acs_redir=", 18)) { >>> disable_acs_redir_param = str + 18; >>> + } else if (!strncmp(str, "pcie_movable_bars", 17)) { >>> + pci_add_flags(PCI_MOVABLE_BARS); >>> } else { >>> printk(KERN_ERR "PCI: Unknown option `%s'\n", >>> str); >>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c >>> index 201f9e5ff55c..bdaafc48dc4c 100644 >>> --- a/drivers/pci/probe.c >>> +++ b/drivers/pci/probe.c >>> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge) >>> return max; >>> } >>> >>> +/* >>> + * Put all devices of the bus and its children to reset >>> + */ >>> +static void pci_bus_reset_prepare(struct pci_bus *bus) >>> +{ >>> + struct pci_dev *dev; >>> + >>> + list_for_each_entry(dev, &bus->devices, bus_list) { >>> + struct pci_bus *child = dev->subordinate; >>> + >>> + if (child) { >>> + pci_bus_reset_prepare(child); >>> + } else if (dev->driver && >>> + dev->driver->err_handler && >>> + dev->driver->err_handler->reset_prepare) { >>> + dev->driver->err_handler->reset_prepare(dev); >>> + } >> >> What about devices with drivers that don't have reset_prepare()? It >> looks like it will just reconfigure them anyway. Is that right? > > It is possible that unprepared driver without these hooks will get BARs > moved, I should put a warning message there. There three ways we can see > to make this safe:> It might make more sense to allow drivers to opt-in to this rather than assuming any driver with the error handlers supports re-assigning BARs on the fly. As far as I know we restore the previous BAR values a part of the EEH recovery process when there's a driver bound, but I could be wrong. > - add the reset_prepare()/reset_done() hooks to *every* PCIe driver; > - refuse BAR movement if at least one unprepared driver has been > encountered during rescan; > - reduce the number of drivers which can be affected to some > controllable value and prepare them on demand. Is there any reason we couldn't just unbind the unaware drivers and re-bind them afterwards? That might be useful when hotplugging NVMe racks since you wouldn't want a switch management driver (or similar) to prevent re-assignments if they're required. > Applying the second proposal as a major restriction seems fairly > reasonable, but for our particular setups and use-cases it is probably > too stiff: > - we've noticed that devices connected directly to the root bridge > don't get moved BARs, and this covers our x86_64 servers: we only > insert/remove devices into "second-level" and "lower" bridges there, but > not root; > > - on PowerNV we have system devices (network interfaces, USB hub, etc.) > grouped into dedicated domain, with all other domains ready for hotplug, > and only these domains can be rescanned. Are you doing anything to enforce this or just relying on people not re-scanning the system device bus? > With our scenarios currently reduced to these two, we can live with just > a few drivers "prepared" for now: NVME and few Ethernet adapters, this > gives us a possibility to use this feature before "converting" *all* the > drivers, and even have the NVidia cards running on a closed proprietary > driver. > > Should we make this behavior adjustable with something like > "pcie_movable_bars=safe" and "pcie_movable_bars=always" ? > > Thanks! > > Best regards, > Serge > >>> + } >>> +} >>> + >>> +/* >>> + * Complete the reset of all devices for the bus and its children >>> + */ >>> +static void pci_bus_reset_done(struct pci_bus *bus) >>> +{ >>> + struct pci_dev *dev; >>> + >>> + list_for_each_entry(dev, &bus->devices, bus_list) { >>> + struct pci_bus *child = dev->subordinate; >>> + >>> + if (child) { >>> + pci_bus_reset_done(child); >>> + } else if (dev->driver && dev->driver->err_handler && >>> + dev->driver->err_handler->reset_done) { >>> + dev->driver->err_handler->reset_done(dev); >>> + } >>> + } >>> +} >>> + >>> /** >>> * pci_rescan_bus - Scan a PCI bus for devices >>> * @bus: PCI bus to scan >>> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus) >>> { >>> unsigned int max; >>> >>> + if (pci_has_flag(PCI_MOVABLE_BARS)) >>> + pci_bus_reset_prepare(bus); >>> max = pci_scan_child_bus(bus); >>> pci_assign_unassigned_bus_resources(bus); >>> + if (pci_has_flag(PCI_MOVABLE_BARS)) >>> + pci_bus_reset_done(bus); >>> pci_bus_add_devices(bus); >>> >>> return max; >>> diff --git a/include/linux/pci.h b/include/linux/pci.h >>> index 6925828f9f25..a8cb1a367c34 100644 >>> --- a/include/linux/pci.h >>> +++ b/include/linux/pci.h >>> @@ -847,6 +847,7 @@ enum { >>> PCI_ENABLE_PROC_DOMAINS = 0x00000010, /* Enable domains in /proc */ >>> PCI_COMPAT_DOMAIN_0 = 0x00000020, /* ... except domain 0 */ >>> PCI_SCAN_ALL_PCIE_DEVS = 0x00000040, /* Scan all, not just dev 0 */ >>> + PCI_MOVABLE_BARS = 0x00000080, /* Runtime BAR reassign after hotplug */ >>> }; >>> >>> /* These external functions are only available when PCI support is enabled */ >>> -- >>> 2.17.1 >>>