On Thu, Oct 07, 2021 at 06:00:20PM -0500, Bjorn Helgaas wrote: > On Sat, Jul 31, 2021 at 02:39:01PM +0200, Lukas Wunner wrote: > > The issue isn't DPC-specific, it also occurs when an error is handled by > > AER through aer_root_reset(). So while the issue was noticed only now, > > it's been around since 2006 when AER support was first introduced. > > > > Fixes: 6c2b374d7485 ("PCI-Express AER implemetation: AER core and aerdriver") [...] > > Cc: stable@xxxxxxxxxxxxxxx # v2.6.19+: ba952824e6c1: PCI/portdrv: Report reset for frozen channel > > Have you tried backporting this to v2.6.19? I bet it's tough. Not > sure we should suggest that stable pick this up unless there's a > reasonable path to do that. The oldest kernel.org stable release that still receives updates is v4.4. There may be older distribution kernels out there which continue to be supported. We don't know for sure, so I think it's customary to tag the release that introduced an issue and leave it to the stable and distribution maintainers to choose what they backport. It's true that patches often do not apply cleanly to older releases. There are some good folks who regularly take up the thankless task of backporting (Sudip Mukherjee to name but one), but I've also frequently backported patches myself where necessary. > > +config PCI_ERROR_RECOVERY > > + def_bool PCIEAER || EEH > > I'm having a hard time connecting this to the code. [...] > But this still seems like it's kind of dangling. It's not obvious to > me why pciehp_slot_reset() should be inside that #ifdef. Do we need > the #ifdef for a functional reason, or is it there because we know it > will never be called? If the latter, I don't think the savings are > worth it. The motivation for the #ifdef was merely to reduce code size if neither of PCIEAER or EEH is enabled. Happy to remove it. I have a different question though. We've often discussed deprecating portdrv and moving port services to the core instead. I've stumbled across commit a39bd851dccf ("PCI/PM: Clear PCIe PME Status bit in core, not PCIe port driver"), wherein you moved pcie_pme_root_status_cleanup() from portdrv to the core, which allowed you to drop the ->resume_noirq callback from portdrv. I've been thinking what moving port services to the core would look like and what your vision for that might be. If that commit is any indication, it seems you'd probably prefer that pciehp_slot_reset() (as introduced in the present patch) is called directly from report_slot_reset() in drivers/pci/pcie/err.c. That would eliminate much of the plumbing contained in the present patch to allow each port service driver to define a ->slot_reset callback and iterate over those callbacks. pciehp is probably the only port service which requires special handling upon ->slot_reset and the same goes for a lot of the error handling and power management callbacks defined by port services. So all that plumbing is probably just a very roundabout way of doing things. When calling into port service code from core code, one needs to find the port service's driver data. For now we can resort to pcie_port_find_device(), but I imagine once we move all port services to the core, we'll be able to access their data directly from struct pci_dev, e.g. via pdev->hotplug or pdev->pcie->hotplug pointers. Does that make sense to you? Thanks, Lukas