[+cc Russell, Ben, Oliver, linuxppc-dev] On Mon, Sep 17, 2018 at 11:55:43PM +0300, Sergey Miroshnichenko wrote: > Hello Sam, > > On 9/17/18 8:28 AM, Sam Bobroff wrote: > > Hi Sergey, > > > > On Fri, Sep 14, 2018 at 07:14:01PM +0300, Sergey Miroshnichenko wrote: > >> Introduce a new command line option "pci=pcie_movable_bars" that indicates > >> support of PCIe hotplug without prior reservation of memory regions by > >> BIOS/bootloader. > >> > >> If a new PCIe device has been hot-plugged between two active ones, which > >> have no (or not big enough) gap between their BARs, allocating new BARs > >> requires to move BARs of the following working devices: > >> > >> 1) dev 4 > >> | > >> v > >> .. | dev 3 | dev 3 | dev 5 | dev 7 | > >> .. | BAR 0 | BAR 1 | BAR 0 | BAR 0 | > >> > >> 2) dev 4 > >> | > >> v > >> .. | dev 3 | dev 3 | --> --> | dev 5 | dev 7 | > >> .. | BAR 0 | BAR 1 | --> --> | BAR 0 | BAR 0 | > >> > >> 3) > >> > >> .. | dev 3 | dev 3 | dev 4 | dev 4 | dev 5 | dev 7 | > >> .. | BAR 0 | BAR 1 | BAR 0 | BAR 1 | BAR 0 | BAR 0 | > >> > >> Not only BARs, but also bridge windows can be updated during a PCIe rescan, > >> threatening all memory transactions during this procedure, so the PCI > >> subsystem will instruct the drivers to pause by calling the reset_prepare() > >> and reset_done() callbacks. > >> > >> If a device may be affected by BAR movement, the BAR changes tracking must > >> be implemented in its driver. > >> > >> Signed-off-by: Sergey Miroshnichenko <s.miroshnichenko@xxxxxxxxx> > >> --- > >> .../admin-guide/kernel-parameters.txt | 6 +++ > >> drivers/pci/pci.c | 2 + > >> drivers/pci/probe.c | 43 +++++++++++++++++++ > >> include/linux/pci.h | 1 + > >> 4 files changed, 52 insertions(+) > >> > >> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > >> index 64a3bf54b974..f8132a709061 100644 > >> --- a/Documentation/admin-guide/kernel-parameters.txt > >> +++ b/Documentation/admin-guide/kernel-parameters.txt > >> @@ -3311,6 +3311,12 @@ > >> bridges without forcing it upstream. Note: > >> this removes isolation between devices and > >> may put more devices in an IOMMU group. > >> + pcie_movable_bars Arrange a space at runtime for BARs of > >> + hotplugged PCIe devices - usable if bootloader > >> + doesn't reserve memory regions for them. Freeing > >> + a space may require moving BARs of active devices > >> + to higher addresses, so device drivers will be > >> + paused during rescan. > >> > >> pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power > >> Management. > >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c > >> index 1835f3a7aa8d..5f07a59b5924 100644 > >> --- a/drivers/pci/pci.c > >> +++ b/drivers/pci/pci.c > >> @@ -6105,6 +6105,8 @@ static int __init pci_setup(char *str) > >> pci_add_flags(PCI_SCAN_ALL_PCIE_DEVS); > >> } else if (!strncmp(str, "disable_acs_redir=", 18)) { > >> disable_acs_redir_param = str + 18; > >> + } else if (!strncmp(str, "pcie_movable_bars", 17)) { > >> + pci_add_flags(PCI_MOVABLE_BARS); > >> } else { > >> printk(KERN_ERR "PCI: Unknown option `%s'\n", > >> str); > >> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c > >> index 201f9e5ff55c..bdaafc48dc4c 100644 > >> --- a/drivers/pci/probe.c > >> +++ b/drivers/pci/probe.c > >> @@ -3138,6 +3138,45 @@ unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge) > >> return max; > >> } > >> > >> +/* > >> + * Put all devices of the bus and its children to reset > >> + */ > >> +static void pci_bus_reset_prepare(struct pci_bus *bus) > >> +{ > >> + struct pci_dev *dev; > >> + > >> + list_for_each_entry(dev, &bus->devices, bus_list) { > >> + struct pci_bus *child = dev->subordinate; > >> + > >> + if (child) { > >> + pci_bus_reset_prepare(child); > >> + } else if (dev->driver && > >> + dev->driver->err_handler && > >> + dev->driver->err_handler->reset_prepare) { > >> + dev->driver->err_handler->reset_prepare(dev); > >> + } > > > > What about devices with drivers that don't have reset_prepare()? It > > looks like it will just reconfigure them anyway. Is that right? > > > > It is possible that unprepared driver without these hooks will get BARs > moved, I should put a warning message there. There three ways we can see > to make this safe: > - add the reset_prepare()/reset_done() hooks to *every* PCIe driver; > - refuse BAR movement if at least one unprepared driver has been > encountered during rescan; > - reduce the number of drivers which can be affected to some > controllable value and prepare them on demand. > > Applying the second proposal as a major restriction seems fairly > reasonable, but for our particular setups and use-cases it is probably > too stiff: > - we've noticed that devices connected directly to the root bridge > don't get moved BARs, and this covers our x86_64 servers: we only > insert/remove devices into "second-level" and "lower" bridges there, but > not root; > - on PowerNV we have system devices (network interfaces, USB hub, etc.) > grouped into dedicated domain, with all other domains ready for hotplug, > and only these domains can be rescanned. > > With our scenarios currently reduced to these two, we can live with just > a few drivers "prepared" for now: NVME and few Ethernet adapters, this > gives us a possibility to use this feature before "converting" *all* the > drivers, and even have the NVidia cards running on a closed proprietary > driver. > > Should we make this behavior adjustable with something like > "pcie_movable_bars=safe" and "pcie_movable_bars=always" ? I like the overall idea of this a lot. - Why do we need a command line parameter to enable this? Can't we do it unconditionally and automatically when it's possible? We could have a chicken switch to *disable* it in case this breaks something horribly, but I would like this functionality to be always available without a special option. - I'm not sure the existence of .reset_done() is evidence that a driver is prepared for its BARs to move. I don't see any documentation that refers to BAR movement, and I doubt it's been tested. But I only see 5 implementations in the tree, so it'd be easy to verify. - I think your second proposal above sounds right: we should regard any device whose driver lacks .reset_done() as immovable. We will likely be able to move some devices but not others. Implementing .reset_done() will increase flexibility but it shouldn't be a requirement for all drivers. > >> + } > >> +} > >> + > >> +/* > >> + * Complete the reset of all devices for the bus and its children > >> + */ > >> +static void pci_bus_reset_done(struct pci_bus *bus) > >> +{ > >> + struct pci_dev *dev; > >> + > >> + list_for_each_entry(dev, &bus->devices, bus_list) { > >> + struct pci_bus *child = dev->subordinate; > >> + > >> + if (child) { > >> + pci_bus_reset_done(child); > >> + } else if (dev->driver && dev->driver->err_handler && > >> + dev->driver->err_handler->reset_done) { > >> + dev->driver->err_handler->reset_done(dev); > >> + } > >> + } > >> +} > >> + > >> /** > >> * pci_rescan_bus - Scan a PCI bus for devices > >> * @bus: PCI bus to scan > >> @@ -3151,8 +3190,12 @@ unsigned int pci_rescan_bus(struct pci_bus *bus) > >> { > >> unsigned int max; > >> > >> + if (pci_has_flag(PCI_MOVABLE_BARS)) > >> + pci_bus_reset_prepare(bus); > >> max = pci_scan_child_bus(bus); > >> pci_assign_unassigned_bus_resources(bus); > >> + if (pci_has_flag(PCI_MOVABLE_BARS)) > >> + pci_bus_reset_done(bus); > >> pci_bus_add_devices(bus); > >> > >> return max; > >> diff --git a/include/linux/pci.h b/include/linux/pci.h > >> index 6925828f9f25..a8cb1a367c34 100644 > >> --- a/include/linux/pci.h > >> +++ b/include/linux/pci.h > >> @@ -847,6 +847,7 @@ enum { > >> PCI_ENABLE_PROC_DOMAINS = 0x00000010, /* Enable domains in /proc */ > >> PCI_COMPAT_DOMAIN_0 = 0x00000020, /* ... except domain 0 */ > >> PCI_SCAN_ALL_PCIE_DEVS = 0x00000040, /* Scan all, not just dev 0 */ > >> + PCI_MOVABLE_BARS = 0x00000080, /* Runtime BAR reassign after hotplug */ > >> }; > >> > >> /* These external functions are only available when PCI support is enabled */ > >> -- > >> 2.17.1 > >>