Re: sa8540p-ride crash when all PCI buses are disabled

Radu Rendec <rrendec@xxxxxxxxxx> · Wed, 16 Aug 2023 12:25:50 -0400

On Tue, 2023-08-15 at 11:54 +0100, Bryan O'Donoghue wrote:
> On 14/08/2023 23:36, Radu Rendec wrote:
> > I'm consistently getting a system crash followed by a ramdump on
> > sa8540p-ride (sc8280xp) when icc_sync_state() goes all the way through
> > (count == providers_count).
> > 
> > Context: all PCIe buses are disabled due to [1]. Previously, due to
> > some local kernel misconfiguration, icc_sync_state() never really did
> > anything (because count was always less than providers_count).
> > 
> > I was able to isolate the problem to the qns_pcie_gem_noc icc node.
> > What happens is that both avg_bw and peak_bw for this node end up as 0
> > after aggregate_requests() gets called. The request list associated
> > with the node is empty.
> 
> If all PCIe buses are disabled, then of course the bandwidth requests
> should say zero, the clocks should be disabled and any associated 
> regulators should be off.
> 
> > For testing purposes, I modified icc_sync_state() to skip calling
> > aggregate_requests() and subsequently p->set(n, n) for that particular
> > node only. With that change in place, the system no longer crashes.
> 
> So what's happening is that a bus master in the system - perhaps not the 
> application processor is issuing a transaction to a register most likely 
> that is not clocked/powered.

Yes, that was my assumption as well. But I didn't think it could be
something other than the AP. That is an interesting perspective.

My first thought was to analyze the ramdump and hopefully find some
clues there. But unfortunately that doesn't seem to be an option with
the tools that I have.

> Have you considered that one of the downstream devices might be causing 
> a PCIe bus transaction ?

No, I haven't considered that. If that's the case, it will probably be
even harder to debug.

> If you physically remove - can you physically remove - devices from the 
> PCIe bus does this error still occur ?

This is a standard QDrive 3 reference board, so I think this is not an
option. Taking those things apart is very difficult, and I think all
peripherals are soldered onto the board anyway.

> > Surprisingly, none of the icc nodes that link to qns_pcie_gem_noc (e.g.
> > xm_pcie3_0, xm_pcie3_1, etc.) has any associated request and so they
> > all have 0 bandwidth after aggregate_requests() gets called, but that
> > doesn't seem to be a problem and the system is stable. This makes me
> > think there is a missing link somewhere, and something doesn't claim
> > any bandwidth on qns_pcie_gem_noc when it should. And it's probably
> > none of the xm_pcie3_* nodes, since setting their bandwidth to 0 seems
> > to be fine.
> 
> Yes so if you assume that the AP/kernel side has the right references, 
> counts, votes then consider another bus master - a thing that can 
> initiate a read or a write might be misbehaving.

There is one thing I wasn't aware of when I wrote the previous email.
As it turns out, bandwidth/clock control is done at the bcm level, not
at the icc node level. It looks like there is a single bcm called PCI0,
and it's linked to the qns_pcie_gem_noc node. The xm_pcie3_* icc nodes
are not linked to any bcm.

This means that *all* PCIe buses are shut down when qns_pcie_gem_noc is
disabled due to zero bandwidth. I was under the (wrong) impression
that, since all xm_pcie3_* nodes had no requests, each corresponding
PCIe bus would be shut down separately, leaving only qns_pcie_gem_noc
active (with my test change in place).

> Assuming there is no misbehaving arm core - say a cDSP or aDSP piece of 
> code that wants to do something on the PCIe bus, might the culprit be
> whatever you have connected to the bus ?
> 
> Could something be driving the #WAKE signal and then transacting ?
> 
> But also keep in mind depending on what you are doing with this system 
> if you have a bit of firmware in one of the DSP cores - does that 
> firmware have scope to talk to any devices on the PCIe bus ?

As I mentioned above, this is a standard QDrive 3 reference board.
Furthermore, I don't explicitly do anything with the DSPs. I just boot
a fairly recent upstream kernel (6.5-rc1) with a standard rootfs. The
boot firmware is whatever Qualcomm provides by default for these
systems. So, unless the boot firmware loads anything into the DSPs
behind my back (which I doubt), the DSPs should not even be running.

What is more likely though is that the boot firmware initializes a
bunch of PCIe devices and leaves them on.

> I'd guess another firmware is unlikely but, a downstream device doing a 
> #WAKE when you have the PCIe nodes disabled would presumably be bad..
> 
> Try looking for an upstream transaction from a device..

Yes, that makes sense. Do you have any suggestion on how to do that
without using any specialized hardware (such as JTAG pod or PCIe bus
analyzer)?

Thanks for all the input and suggestions!

--
Radu