Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port

Bjorn Helgaas <helgaas@xxxxxxxxxx> · Tue, 2 Oct 2018 14:35:22 -0500

On Mon, Oct 01, 2018 at 09:14:51AM -0600, Keith Busch wrote:
> On Fri, Sep 28, 2018 at 06:28:02PM -0500, Bjorn Helgaas wrote:
> > On Fri, Sep 28, 2018 at 03:35:23PM -0600, Keith Busch wrote:
> > > The assumption I'm making (which I think is a safe assumption with
> > > general consensus) is that errors detected on an end point or an upstream
> > > port happened because of something wrong with the link going upstream:
> > > end devices have no other option, 
> > 
> > Is this really true?  It looks like "Internal Errors" (sec 6.2.9) may
> > be unrelated to a packet or event (though they are supposed to be
> > associated with a PCIe interface).
> > 
> > It says the only method of recovering is reset or hardware
> > replacement.  It doesn't specify, but it seems like a FLR or similar
> > reset might be appropriate and we may not have to reset the link.
> 
> That is an interesting case we might want to handle better. I've a couple
> concerns to consider for implementing:
> 
> We don't know an ERR_FATAL occured for an internal reason until we read the
> config register across the link, and the AER driver historically avoided
> accessing potentially unhealthy links. I don't *think* it's harmful to
> attempt reading the register, but we'd just need to check for an "all 1's"
> completion before trusting the result.
> 
> The other issue with trying to use FLR is a device may not implement
> it, so pci reset has fallback methods depending on the device's
> capabilities. We can end up calling pci_parent_bus_reset(), which does the
> same secondary bus reset that already happens as part of error recovery.
> We'd just need to make sure affected devices and drivers have a chance
> to be notified (which is the this patch's intention).
>  
> > Getting back to the changelog, "error handling can only run on
> > bridges" clearly doesn't refer to the driver callbacks (since those
> > only apply to endpoints).  Maybe "error handling" refers to the
> > reset_link(), which can only be done on a bridge?
> 
> Yep, referring to how link reset_link is only sent from bridges.
>  
> > That would make sense to me, although the current code may be
> > resetting more devices than necessary if Internal Errors can be
> > handled without a link reset.
> 
> That sounds good, I'll test some scenarios out here.

The main point here is that we call the driver callbacks for all every
device that might be reset.  If that set of devices is larger than
strictly necessary, that's an opportunity for future optimization,
which we can defer for now.

Here's my proposal for the changelog.  Let me know what I screwed up.

commit 1f7d2967334433d885c0712b8ac3f073f20211ee
Author: Keith Busch <keith.busch@xxxxxxxxx>
Date:   Thu Sep 20 10:27:13 2018 -0600

    PCI/ERR: Run error recovery callbacks for all affected devices

    If an Endpoint reported an error with ERR_FATAL, we previously ran driver
    error recovery callbacks only for the Endpoint's driver.  But if we reset a
    Link to recover from the error, all downstream components are affected,
    including the Endpoint, any multi-function peers, and children of those
    peers.

    Initiate the Link reset from the deepest Downstream Port that is
    reliable, and call the error recovery callbacks for all its children.

    If a Downstream Port (including a Root Port) reports an error, we assume
    the Port itself is reliable and we need to reset its downstream Link.  In
    all other cases (Switch Upstream Ports, Endpoints, Bridges, etc), we assume
    the Link leading to the component needs to be reset, so we initiate the
    reset at the parent Downstream Port.

    This allows two other clean-ups.  First, we currently only use a Link
    reset, which can only be initiated using a Downstream Port, so we can
    remove checks for Endpoints.  Second, the Downstream Port where we initiate
    the Link reset is reliable (unlike the device that reported the error), so
    the special cases for error detect and resume are no longer necessary.

    Signed-off-by: Keith Busch <keith.busch@xxxxxxxxx>
    [bhelgaas: changelog]
    Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
    Reviewed-by: Sinan Kaya <okaya@xxxxxxxxxx>