Quoting Stephen Boyd (2022-04-11 13:20:27) > > I think it's some IPA unclocked access because it's an SError async > abort. I looked some at the IPA clk implementation and it is a little > concerning. I see two problems. The first problem is that clk-rpmh is > only voting on the active state for the IPA clk. Maybe RPMh will move > the sleep/wake state over from the active state automatically as long as > we never make a request in either of those states? I don't know, but it > looks concerning and either needs to be fixed or a big comment > indicating that the copy happens. This patch makes it set a frequency in > each state bucket. > This patch doesn't seem to matter. I think that's because RPMh copies over active state settings to sleep and wake state automatically if those states are never changed by that processor. Someone at qcom needs to confirm this theory. I'll send the patch and see if that spurs someone at qcom to respond. > > The second problem I see is that the IPA clk resource is "IP0" but it is > still defined in the interconnect/qcom/sc7180.c file. > > $ git grep \"IP0\" -- drivers/{interconnect,clk}/ > drivers/clk/qcom/clk-rpmh.c:DEFINE_CLK_RPMH_BCM(sdm845, ipa, "IP0"); > drivers/clk/qcom/clk-rpmh.c:DEFINE_CLK_RPMH_BCM(sdx55, ipa, "IP0"); > drivers/interconnect/qcom/sc7180.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave); > drivers/interconnect/qcom/sc8180x.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &slv_ipa_core_slave); > drivers/interconnect/qcom/sdx55.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave); > drivers/interconnect/qcom/sm8150.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave); > drivers/interconnect/qcom/sm8250.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave); > > That sounds pretty bad because it means both interconnect and clk frameworks > are trying to control the same RPMh resource, trampling over each other. > Combine that with unused clks and sync_state support and I don't know > what the state of the IP0 resource really is anymore. Alright, some more debugging has confirmed this. I put a print in the rpmh driver to figure out when the IP0 resource, according to cmd-db, is being modified. Luckily, the interconnect driver uses the rpmh batch API while the clk driver uses the direct write API. I put a dump_stack() when the batch API is called on the IP0 address and that sometimes gets zeroed out (i.e. IPA clk is turned off) after the rpmh clk driver turned it ON. The rpmh driver in the kernel doesn't do any aggregation between kernel clients, so the problem is this sequence: IPA driver probes runtime PM get() IPA clk enabled -> IP0 resource is ON interconnect zeroes out the IP0 resource -> IP0 resource is off IPA driver tries to access a register and blows up The interconnect framework is zeroing it out now because there's a sync_state callback once all drivers have probed. This is why I saw it more easily when the IPA driver is builtin vs. a module. The IPA driver is much more likely to have turned on the resource before sync_state is called, but the use of autosuspend made it variable. If the IPA driver autosuspend callback is delayed just enough, then IPA will turn on the IP0 resource via the clk API, sync state will turn it off from the interconnect framework, and then the pm_runtime_get_sync() in the IPA driver will skip powering the clk on again because it never turned it off. The runtime PM state of the device is correct, but the clk is off. Oops! I think changing the IPA_AUTOSUSPEND_DELAY define to be multiple seconds makes it even more likely, because then the IPA device definitely won't be suspended by the time interconnect sync state happens. Anyway, this also explains why the IPA runtime PM patch made things better. Sometimes the IPA device will be runtime suspended, and thus it will power on the IP0 resource on runtime resume even after interconnect sync state happened. This is the situation on v5.10.y, where runtime PM isn't happening at all, but sync state is once commit b95b668eaaa2 ("interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate") is included. The IP0 resource is guaranteed to be turned off after sync state drops the request. So the fix is simple, completely remove IP0 from the interconnect driver so that sync state doesn't turn it off. Then the IPA driver without runtime PM will work because the clk resource is turned on and kept on until system suspend. When the runtime PM patch in IPA is applied, it will also work because runtime PM will power things on correctly, or keep them powered on because autosuspend is delayed. I'll send this patch to the list shortly. And split it up for the other drivers too.