Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port

Keith Busch <keith.busch@xxxxxxxxx> · Mon, 1 Oct 2018 09:14:51 -0600

On Fri, Sep 28, 2018 at 06:28:02PM -0500, Bjorn Helgaas wrote:
> On Fri, Sep 28, 2018 at 03:35:23PM -0600, Keith Busch wrote:
> > The assumption I'm making (which I think is a safe assumption with
> > general consensus) is that errors detected on an end point or an upstream
> > port happened because of something wrong with the link going upstream:
> > end devices have no other option, 
> 
> Is this really true?  It looks like "Internal Errors" (sec 6.2.9) may
> be unrelated to a packet or event (though they are supposed to be
> associated with a PCIe interface).
> 
> It says the only method of recovering is reset or hardware
> replacement.  It doesn't specify, but it seems like a FLR or similar
> reset might be appropriate and we may not have to reset the link.

That is an interesting case we might want to handle better. I've a couple
concerns to consider for implementing:

We don't know an ERR_FATAL occured for an internal reason until we read the
config register across the link, and the AER driver historically avoided
accessing potentially unhealthy links. I don't *think* it's harmful to
attempt reading the register, but we'd just need to check for an "all 1's"
completion before trusting the result.

The other issue with trying to use FLR is a device may not implement
it, so pci reset has fallback methods depending on the device's
capabilities. We can end up calling pci_parent_bus_reset(), which does the
same secondary bus reset that already happens as part of error recovery.
We'd just need to make sure affected devices and drivers have a chance
to be notified (which is the this patch's intention).

> Getting back to the changelog, "error handling can only run on
> bridges" clearly doesn't refer to the driver callbacks (since those
> only apply to endpoints).  Maybe "error handling" refers to the
> reset_link(), which can only be done on a bridge?

Yep, referring to how link reset_link is only sent from bridges.

> That would make sense to me, although the current code may be
> resetting more devices than necessary if Internal Errors can be
> handled without a link reset.

That sounds good, I'll test some scenarios out here.