Re: [PATCH v3 2/4] PCI: brcmstb: Add ACPI config space quirk

Pali Rohár <pali@xxxxxxxxxx> · Fri, 22 Oct 2021 19:57:06 +0200

On Friday 22 October 2021 10:29:48 Florian Fainelli wrote:
> On 10/22/21 10:17 AM, Pali Rohár wrote:
> > On Friday 22 October 2021 10:04:36 Florian Fainelli wrote:
> >> On 10/5/21 7:07 PM, Florian Fainelli wrote:
> >>>
> >>>
> >>> On 10/5/2021 3:25 PM, Jeremy Linton wrote:
> >>>> Hi,
> >>>>
> >>>> On 10/5/21 2:43 PM, Pali Rohár wrote:
> >>>>> Hello!
> >>>>>
> >>>>> On Tuesday 05 October 2021 10:57:18 Jeremy Linton wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> On 10/5/21 10:32 AM, Bjorn Helgaas wrote:
> >>>>>>> On Thu, Aug 26, 2021 at 02:15:55AM -0500, Jeremy Linton wrote:
> >>>>>>>> Additionally, some basic bus/device filtering exist to avoid sending
> >>>>>>>> config transactions to invalid devices on the RP's primary or
> >>>>>>>> secondary bus. A basic link check is also made to assure that
> >>>>>>>> something is operational on the secondary side before probing the
> >>>>>>>> remainder of the config space. If either of these constraints are
> >>>>>>>> violated and a config operation is lost in the ether because an EP
> >>>>>>>> doesn't respond an unrecoverable SERROR is raised.
> >>>>>>>
> >>>>>>> It's not "lost"; I assume the root port raises an error because it
> >>>>>>> can't send a transaction over a link that is down.
> >>>>>>
> >>>>>> The problem is AFAIK because the root port doesn't do that.
> >>>>>
> >>>>> Interesting! Does it mean that PCIe Root Complex / Host Bridge (which I
> >>>>> guess contains also logic for Root Port) does not signal transaction
> >>>>> failure for config requests? Or it is just your opinion? Because I'm
> >>>>> dealing with similar issues and I'm trying to find a way how to detect
> >>>>> if some PCIe IP signal transaction error via AXI SLVERR response OR it
> >>>>> just does not send any response back. So if you know some way how to
> >>>>> check which one it is, I would like to know it too.
> >>>>
> >>>> This is my _opinion_ based on what I've heard of some other IP
> >>>> integration issues, and what i've seen poking at this one from the
> >>>> perspective of a SW guy rather than a HW guy. So, basically worthless.
> >>>> But, you should consider that most of these cores/interconnects aren't
> >>>> aware of PCIe completion semantics so its the root ports
> >>>> responsibility to say, gracefully translate a non-posted write that
> >>>> doesn't have a completion for the interconnects its attached to,
> >>>> rather than tripping something generic like a SLVERR.
> >>>>
> >>>> Anyway, for this I would poke around the pile of exception registers,
> >>>> with your specific processors manual handy because a lot of them are
> >>>> implementation defined.
> >>>
> >>> I should be able to get you an answer in the new few days whether
> >>> configuration space requests also generate an error towards the ARM CPU,
> >>> since memory space requests most definitively do.
> >>
> >> Did not get an answer from the design team, but going through our bug
> >> tracker, there were evidences of configuration space accesses also
> >> generating external aborts:
> >>
> >> [    8.988237] Unhandled fault: synchronous external abort (0x96000210) at 0xffffff8009539004
> >> [    9.026698] PC is at pci_generic_config_read32+0x30/0xb0
> > 
> > So this is error caused by reading from config space.
> > 
> > Can you check if also writing to config space can trigger some crash? If
> > yes, I would like to know if write would be also synchronous or rather
> > asynchronous abort.
> 
> Yes it does and AFAICT it always shows up as a system error interrupt,
> here is an example:
> 
> # setpci -d *:* latency_timer=40
> [   25.909644] SError Interrupt on CPU2, code 0xbf000002 -- SError
> [   25.909652] pc : pci_user_write_config_byte+0x6c/0x78
> [   25.909706] Kernel panic - not syncing: Asynchronous SError Interrupt

Ok! So writing to config space cause asynchronous abort.

Looking at the codes and 0x96000210 on all ARMv8 should be Data Abort.
0xbf...... on ARMv8 is SError interrupt and other bits are CPU core
specific. What CPU core do you have on this machine? I have just decoder
for A53 core and on this core value 0xbf000002 means "SLVERR on external
access". But I guess that it would mean also SLVERR for your CPU core.

Because Exactly same behavior I'm seeing with PCIe controller on A3720
SoC which has A53 core. It looks like that PCIe controller translates
PCIe CA and UR responses to AXI SLVERR responses which are delivered to
CPU and kernel just see these fatal error interrupts. And same issue is
not only for config requests but also for memory read / write commands.

In my case PCIe controller really receives response (timeout does not
occur) from PCIe core (which probably timeouts as it cannot send message
when link is down) but instead of translating them to SLVOK with
fabricated 0xffffffff response it sends to CPU that fatal SLVERR.

I was told that the fix for this kind of issue is to "reconfigure" PCIe
controller to never send SLVERR to CPU. And instead fabricate 0xffffffff
SLVOK response. It should be configurable in PCIe wrapper or PCIe glue
IP which do connection between CPU / AXI and PCIe core.

I do not know if there is any way how to "ignores" these SLVERR
responses from PCIe controller sent to CPU.