On 14/08/2023 23:36, Radu Rendec wrote:
Hello everyone,
I'm consistently getting a system crash followed by a ramdump on
sa8540p-ride (sc8280xp) when icc_sync_state() goes all the way through
(count == providers_count).
Context: all PCIe buses are disabled due to [1]. Previously, due to
some local kernel misconfiguration, icc_sync_state() never really did
anything (because count was always less than providers_count).
I was able to isolate the problem to the qns_pcie_gem_noc icc node.
What happens is that both avg_bw and peak_bw for this node end up as 0
after aggregate_requests() gets called. The request list associated
with the node is empty.
If all PCIe buses are disabled, then of course the bandwidth requests
should say zero, the clocks should be disabled and any associated
regulators should be off.
For testing purposes, I modified icc_sync_state() to skip calling
aggregate_requests() and subsequently p->set(n, n) for that particular
node only. With that change in place, the system no longer crashes.
So what's happening is that a bus master in the system - perhaps not the
application processor is issuing a transaction to a register most likely
that is not clocked/powered.
Have you considered that one of the downstream devices might be causing
a PCIe bus transaction ?
If you physically remove - can you physically remove - devices from the
PCIe bus does this error still occur ?
Surprisingly, none of the icc nodes that link to qns_pcie_gem_noc (e.g.
xm_pcie3_0, xm_pcie3_1, etc.) has any associated request and so they
all have 0 bandwidth after aggregate_requests() gets called, but that
doesn't seem to be a problem and the system is stable. This makes me
think there is a missing link somewhere, and something doesn't claim
any bandwidth on qns_pcie_gem_noc when it should. And it's probably
none of the xm_pcie3_* nodes, since setting their bandwidth to 0 seems
to be fine.
Yes so if you assume that the AP/kernel side has the right references,
counts, votes then consider another bus master - a thing that can
initiate a read or a write might be misbehaving.
Assuming there is no misbehaving arm core - say a cDSP or aDSP piece of
code that wants to do something on the PCIe bus, might the culprit be
whatever you have connected to the bus ?
Could something be driving the #WAKE signal and then transacting ?
But also keep in mind depending on what you are doing with this system
if you have a bit of firmware in one of the DSP cores - does that
firmware have scope to talk to any devices on the PCIe bus ?
I'd guess another firmware is unlikely but, a downstream device doing a
#WAKE when you have the PCIe nodes disabled would presumably be bad..
Try looking for an upstream transaction from a device..
---
bod