On Mon, Oct 01, 2018 at 09:14:51AM -0600, Keith Busch wrote: > On Fri, Sep 28, 2018 at 06:28:02PM -0500, Bjorn Helgaas wrote: > > On Fri, Sep 28, 2018 at 03:35:23PM -0600, Keith Busch wrote: > > > The assumption I'm making (which I think is a safe assumption with > > > general consensus) is that errors detected on an end point or an upstream > > > port happened because of something wrong with the link going upstream: > > > end devices have no other option, > > > > Is this really true? It looks like "Internal Errors" (sec 6.2.9) may > > be unrelated to a packet or event (though they are supposed to be > > associated with a PCIe interface). > > > > It says the only method of recovering is reset or hardware > > replacement. It doesn't specify, but it seems like a FLR or similar > > reset might be appropriate and we may not have to reset the link. > > That is an interesting case we might want to handle better. I've a couple > concerns to consider for implementing: > > We don't know an ERR_FATAL occured for an internal reason until we read the > config register across the link, and the AER driver historically avoided > accessing potentially unhealthy links. I don't *think* it's harmful to > attempt reading the register, but we'd just need to check for an "all 1's" > completion before trusting the result. > > The other issue with trying to use FLR is a device may not implement > it, so pci reset has fallback methods depending on the device's > capabilities. We can end up calling pci_parent_bus_reset(), which does the > same secondary bus reset that already happens as part of error recovery. > We'd just need to make sure affected devices and drivers have a chance > to be notified (which is the this patch's intention). > > > Getting back to the changelog, "error handling can only run on > > bridges" clearly doesn't refer to the driver callbacks (since those > > only apply to endpoints). Maybe "error handling" refers to the > > reset_link(), which can only be done on a bridge? > > Yep, referring to how link reset_link is only sent from bridges. > > > That would make sense to me, although the current code may be > > resetting more devices than necessary if Internal Errors can be > > handled without a link reset. > > That sounds good, I'll test some scenarios out here. The main point here is that we call the driver callbacks for all every device that might be reset. If that set of devices is larger than strictly necessary, that's an opportunity for future optimization, which we can defer for now. Here's my proposal for the changelog. Let me know what I screwed up. commit 1f7d2967334433d885c0712b8ac3f073f20211ee Author: Keith Busch <keith.busch@xxxxxxxxx> Date: Thu Sep 20 10:27:13 2018 -0600 PCI/ERR: Run error recovery callbacks for all affected devices If an Endpoint reported an error with ERR_FATAL, we previously ran driver error recovery callbacks only for the Endpoint's driver. But if we reset a Link to recover from the error, all downstream components are affected, including the Endpoint, any multi-function peers, and children of those peers. Initiate the Link reset from the deepest Downstream Port that is reliable, and call the error recovery callbacks for all its children. If a Downstream Port (including a Root Port) reports an error, we assume the Port itself is reliable and we need to reset its downstream Link. In all other cases (Switch Upstream Ports, Endpoints, Bridges, etc), we assume the Link leading to the component needs to be reset, so we initiate the reset at the parent Downstream Port. This allows two other clean-ups. First, we currently only use a Link reset, which can only be initiated using a Downstream Port, so we can remove checks for Endpoints. Second, the Downstream Port where we initiate the Link reset is reliable (unlike the device that reported the error), so the special cases for error detect and resume are no longer necessary. Signed-off-by: Keith Busch <keith.busch@xxxxxxxxx> [bhelgaas: changelog] Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx> Reviewed-by: Sinan Kaya <okaya@xxxxxxxxxx>