On Fri, Sep 28, 2018 at 06:28:02PM -0500, Bjorn Helgaas wrote: > On Fri, Sep 28, 2018 at 03:35:23PM -0600, Keith Busch wrote: > > The assumption I'm making (which I think is a safe assumption with > > general consensus) is that errors detected on an end point or an upstream > > port happened because of something wrong with the link going upstream: > > end devices have no other option, > > Is this really true? It looks like "Internal Errors" (sec 6.2.9) may > be unrelated to a packet or event (though they are supposed to be > associated with a PCIe interface). > > It says the only method of recovering is reset or hardware > replacement. It doesn't specify, but it seems like a FLR or similar > reset might be appropriate and we may not have to reset the link. That is an interesting case we might want to handle better. I've a couple concerns to consider for implementing: We don't know an ERR_FATAL occured for an internal reason until we read the config register across the link, and the AER driver historically avoided accessing potentially unhealthy links. I don't *think* it's harmful to attempt reading the register, but we'd just need to check for an "all 1's" completion before trusting the result. The other issue with trying to use FLR is a device may not implement it, so pci reset has fallback methods depending on the device's capabilities. We can end up calling pci_parent_bus_reset(), which does the same secondary bus reset that already happens as part of error recovery. We'd just need to make sure affected devices and drivers have a chance to be notified (which is the this patch's intention). > Getting back to the changelog, "error handling can only run on > bridges" clearly doesn't refer to the driver callbacks (since those > only apply to endpoints). Maybe "error handling" refers to the > reset_link(), which can only be done on a bridge? Yep, referring to how link reset_link is only sent from bridges. > That would make sense to me, although the current code may be > resetting more devices than necessary if Internal Errors can be > handled without a link reset. That sounds good, I'll test some scenarios out here.