On Tue, Sep 04, 2018 at 12:33:09PM -0600, Jon Derrick wrote: > During probe, the port driver will disable error reporting and assumes > it will be enabled later by the AER driver's pci_walk_bus() sequence. > This may not be the case for host-bridge enabled root ports, who will > enable first error reporting on the bus during the root port probe, and > then disable error reporting on downstream devices during subsequent > probing of the bus. I understand the hotplug case (see below), but help me understand this "host-bridge enabled root ports" thing. I'm not sure what that means. We run pcie_portdrv_probe() for every root port, switch upstream port, and switch downstream port, and it always disables error reporting for the port: pcie_portdrv_probe # pci_driver .probe pcie_port_device_register get_port_device_capability services |= PCIE_PORT_SERVICE_AER pci_disable_pcie_error_reporting # clear DEVCTL Error Reporting Enables For root ports, we call aer_probe(), and it enables error reporting for the entire tree below the root port: aer_probe # pcie_port_service .probe aer_enable_rootport set_downstream_devices_error_reporting(dev, true) pci_walk_bus(dev->subordinate, set_device_error_reporting) set_device_error_reporting if (Root Port || Upstream Port || Downstream Port) pci_enable_pcie_error_reporting # set DEVCTL Error Reporting Enables This is definitely broken for hot-added switches because aer_probe() is the only place we enable error reporting, and it's only run when we enumerate a root port, not when we hot-add things below that root port. > A hotplugged port device may also fail to enable error reporting as the > AER driver has already run on the root bus. > Check for these conditions and enable error reporting during portdrv > probing. > > Example case: pcie_portdrv_probe(10000:00:00.0): > [ 343.790573] pcieport 10000:00:00.0: pci_disable_pcie_error_reporting aer_probe(10000:00:00.0): > [ 343.809812] pcieport 10000:00:00.0: pci_enable_pcie_error_reporting > [ 343.819506] pci 10000:01:00.0: pci_enable_pcie_error_reporting > [ 343.828814] pci 10000:02:00.0: pci_enable_pcie_error_reporting > [ 343.838089] pci 10000:02:01.0: pci_enable_pcie_error_reporting > [ 343.847478] pci 10000:02:02.0: pci_enable_pcie_error_reporting > [ 343.856659] pci 10000:02:03.0: pci_enable_pcie_error_reporting > [ 343.865794] pci 10000:02:04.0: pci_enable_pcie_error_reporting > [ 343.874875] pci 10000:02:05.0: pci_enable_pcie_error_reporting > [ 343.883918] pci 10000:02:06.0: pci_enable_pcie_error_reporting > [ 343.892922] pci 10000:02:07.0: pci_enable_pcie_error_reporting pcie_portdrv_probe(10000:01:00.0): > [ 343.918900] pcieport 10000:01:00.0: pci_disable_pcie_error_reporting pcie_portdrv_probe(10000:02:00.0): > [ 343.968426] pcieport 10000:02:00.0: pci_disable_pcie_error_reporting ... > [ 344.028179] pcieport 10000:02:01.0: pci_disable_pcie_error_reporting > [ 344.091269] pcieport 10000:02:02.0: pci_disable_pcie_error_reporting > [ 344.156473] pcieport 10000:02:03.0: pci_disable_pcie_error_reporting > [ 344.238042] pcieport 10000:02:04.0: pci_disable_pcie_error_reporting > [ 344.321864] pcieport 10000:02:05.0: pci_disable_pcie_error_reporting > [ 344.411601] pcieport 10000:02:06.0: pci_disable_pcie_error_reporting > [ 344.505332] pcieport 10000:02:07.0: pci_disable_pcie_error_reporting > [ 344.621824] nvme 10000:06:00.0: pci_enable_pcie_error_reporting > > Signed-off-by: Jon Derrick <jonathan.derrick@xxxxxxxxx> > --- > drivers/pci/pcie/portdrv_core.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/drivers/pci/pcie/portdrv_core.c b/drivers/pci/pcie/portdrv_core.c > index 7c37d81..fdd953a 100644 > --- a/drivers/pci/pcie/portdrv_core.c > +++ b/drivers/pci/pcie/portdrv_core.c > @@ -343,6 +343,16 @@ int pcie_port_device_register(struct pci_dev *dev) > if (!nr_service) > goto error_cleanup_irqs; > > +#ifdef CONFIG_PCIEAER > + /* > + * Enable error reporting for this port in case AER probing has already > + * run on the root bus or this port device is hot-inserted > + */ > + if (dev->aer_cap && pci_aer_available() && > + (pcie_ports_native || pci_find_host_bridge(dev->bus)->native_aer)) > + pci_enable_pcie_error_reporting(dev); > +#endif I plan to apply this after we clarify the changelog a bit, but I don't really like this patch because it (and the corresponding code added by 2bd50dd800b5 ("PCI: PCIe: Disable PCIe port services during port initialization")) seem a little out of place. The way I think this *should* work is that the PCI core should arrange to handle AER interrupts when it enumerates the devices that can generate them (Root Ports and Root Complex Event Collectors), even before it enumerates the devices below the Root Port. Then the PCI core could directly enable the AER interrupts on all devices as it enumerates them. I would envision both cases being handled somewhere like pci_aer_init() in pci_init_capabilities(). This would also allow us to get rid of the pci_enable_pcie_error_reporting() calls that are currently sprinkled around in drivers, because that would be handled by the core for all devices. Bjorn