Re: pci_error_handlers for pci_host_bridge ?

Subrahmanya Lingappa <l.subrahmanya@xxxxxxxxxxxxxx> · Fri, 8 Jun 2018 17:27:12 +0530

Bjorn,

On Thu, Jun 7, 2018 at 7:29 PM, Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> On Thu, Jun 07, 2018 at 04:20:10PM +0530, poza@xxxxxxxxxxxxxx wrote:
>> On 2018-06-07 15:45, Subrahmanya Lingappa wrote:
>> > Bjorn,
>> >
>> > On Wed, Jun 6, 2018 at 6:20 PM, Bjorn Helgaas <helgaas@xxxxxxxxxx>
>> > wrote:
>> > > Hi Subrahmanya,
>> > >
>> > > On Wed, Jun 06, 2018 at 05:57:17PM +0530, Subrahmanya Lingappa wrote:
>> > > > Hi,
>> > > > according to https://github.com/torvalds/linux/blob/master/Documentation/PCI/pci-error-recovery.txt
>> > > >
>> > > > as part of AER handling, struct pci_error_handlers{} is
>> > > > implemented by
>> > > > endpoint drivers to handle device specific recovery steps for "struct
>> > > > pci_driver".
>> > > >
>> > > > But we have a platform_driver which implements "struct
>> > > > pci_host_bridge" which also supports AER capability how can we
>> > > > support
>> > > > pci_error_handlers() for the host bridge drivers ?
>> > >
>> > > I assume you're referring to Mobiveil.  Can you explain more of the
>> > > topology here?  Can you also include "sudo lspci -vv" output?
>> > >
>> > Yes, it is for Mobiveil's Host bridge driver :
>> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/pci/host/pcie-mobiveil.c
>> > lspci output is now is not available, I'll try to get it sooner.
>> >
>> > we have an endpoint connected directly to Rootport as follows
>> > RC Rootport ----->BUS------> EP
>> >
>> > > The AER capability is an optional capability of PCIe device functions.
>> > > A host bridge is not itself a PCIe function; it's a bridge between a
>> > > platform-specific host bus and the PCIe bus.
>> > >
>> > > Sometimes there is a PCI function that corresponds to the host bridge,
>> > > but that's not required by the PCI specs and there is no generic
>> > > programming model for it.
>> > >
>> > > If you have an PCIe function corresponding to the Mobiveil host
>> > > bridge, and it has an AER capability, what would you want the error
>> > > handlers to do?  This function would not normally be a Root Port or
>> > > other type 1 PCI-to-PCI bridge device, so it's not clear how its AER
>> > > would integrate with the PCIe hierarchy.
>> > >
>> > Yes we do have a PCI function with AER capability, after an AER
>> > reported by EP, AER driver initiates an hot_reset on subordinate
>> > bus, which happens to be downstream port for RC. So we get a
>> > downstream port link down happens in this case RC driver needs to
>> > follow a specific register restore sequence, which is most of the
>> > HW specific initialization done in probe function of the driver
>> > to recover properly.
>
> I'm still not quite getting your point.  Let's try a concrete example.
> Assume this topology (this is what you described above):
>
>   00:00.0 Root Port to [bus 01]
>   01:00.0 Endpoint
>
> Then we have this situation:
>
>   - AER driver binds to Root Port 00:00.0
>   - Endpoint 01:00.0 detects an error and sends ERR_* message upstream
>   - Root Port 00:00.0 receives ERR_* message
>   - Root Port 00:00.0 generates AER interrupt
>   - AER driver handles the interrupt and does a secondary bus reset on
>     Root Port 00:00.0.  This resets the Endpoint (01:00.0) but not the
>     Root Port itself.
>
> It sounds like you need to do an RC register restore sequence at this
> point, i.e., something like mobiveil_host_init()?  That sounds like a
> hardware defect.  Doing a secondary bus reset via the Root Port's
> bridge control register should have no effect on the RC itself.
>
> What am I missing?
Yes, in this situation PCI config registers are fine, but due to a bug
RC, config space registers will get reset
in the SW we need to call the mobiveil_host_init() again to recover in
case of directly connected EP to the Root-port.

We do get an local interrupt when this problem occurs, so this can be
handled in Driver's ISR;
or is there a better way to handle it ?

Thanks.

>
>> Are you looking at something similar to pci_error_handlers to be called for
>> your RC driver ?
>> where probably you are expecting during ERR_NONFATAL recovery you would want
>> to restore some of your platform
>> specific registers. I dont think that support exists now. since
>> pci_error_handlers is of struct pci_driver
>> while yours is platform driver.
>>
>> Although please also note that ERR_FATAL is no more handled with error and
>> recovery callbacks.
>> that are just going to be handled with removal, re-enumeration of the
>> devices.
>>
>> but I suppose in any case you want to restore the registers in any type of
>> uncorrectable error.
>>
>> although this is really platform specific, some sort of quirk I cna think
>> of, but again err.c has to check
>> that quirk's existence and calls platform specific callback
>> (that again I am not sure because such things do not exist with respect to
>> error/recovery callbacks)
>>
>> Yeah just re-thinking, this is too specific, not to be addressed by generic
>> framework I think.
>>
>> > I was wondering if this can be handled by using AER error handlers, or
>> > would suggest a better way to handle this ?
>> >
>> > As of now plan is to handle this situation is by calling a minimal
>> > probe recovery sequence after link down interrupt within the
>> > driver interrupt service routine.
>>
>> Well I think that is a better place,
>> I was wondering why are you loosing registers at the first point ?
>> Is because of link down even you are loosing them ? some issue with hw !
>>
>>
>> Lets hear from Bjorn anyway, I am curious.
>>
>> >
>> > > Bjorn
>> >
>> > Thanks,