It seems like 'CSWR' idle may never have actually worked properly on the OMAP5... As an experiment, I took the old TI 3.8.y GLSDK kernel, commit 2c871a879dbb4234232126f7075468d5bf0a50e3 and made the following changes: - Enabling CONFIG_CPU_IDLE as this was not in omap2plus_defconfig back then - Disabling all the kernel debugging related config, as these seem to significantly reduce the frequency of lockups - OSWR idle disabled, as this is known broken - Some small patches to get it working with gcc9, none of which touched any power management or idle code. And I saw lockups with an almost identical frequency to 5.6 and 5.7 with a similar config; and the same pipeline stalled error reported by CCS when connecting over JTAG. The only difference is the reported PC was a read instruction inside sched_clock rather than omap4_prminst_read_inst_reg. Would be interested to know if there is a backstory here? Could it be related to the bugs that stopped OSWR from working? Is there a glsdk kernel version that I missed where CSWR on the OMAP5 actually works reliably. If anyone wants to try reproducing this; the most important settings are: - CONFIG_CPU_IDLE=y - All kernel debugging settings disabled - CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y This will usually result in a lockup while idle at a login prompt within a few hours with no other hardware connected. A lockup usually occurs sooner (within 30 minutes) repeatedly wget'ing a 100MB test file in a loop. Best David On Sat, 2020-08-01 at 21:57 +0100, David Shah wrote: > A tiny bit more information, if anyone has any more ideas. > > I can confirm that this happened once with the device idle, and no > networking connection. > > Based on the information I have been able to extract, the call stack > does > seem to involve omap4_enter_lowpower but I can't be certain. > > The main JTAG access I have is to be able to read out what seems to > be > kernel virtual memory via the other, non-locked-up but WFI, core. I > attempted to add some tracing via writing a value to a global > variable > inside the problem function and then flushing the D$, but the delay > this > adds (or the cache flush itself) seems to stop the lockup from > occuring > most of the time. It did lock up once with this added, but then > reading > out that area of memory failed, possibly because the locked up core > was > confusing the cache coherency magic inside the cores. > > Since that lock-up I added 20 NOPs after the cache flush, to try and > make > sure the cache flush really does work, and with those added it does > not > lock up at all. > > Is there a better way to take advantage of this ability to read out > memory for debugging? > > Best > > David > > > On Sun, 2020-07-26 at 18:59 +0100, David Shah wrote: > > Hi all, > > > > I am looking into random lockups - significantly rarer than once a > > day > > in typical usage, various patterns like lots of bursty network > > traffic > > increase frequency - that affect both the uEVM and the Pyra (also > > OMAP5432 based) on newer kernels (currently testing with 5.6 but I > > have > > seen lockups with 5.7 too). > > > > Currently I'm working with the uEVM as it is a bit easier to > > connect > > the JTAG adapter. I managed to get a lockup with the JTAG attached, > > and > > unfortunately the processor is badly locked up enough (presumably a > > stuck memory bus?) that JTAG isn't able to get a register dump or > > stacktrace. But I do get the following error which at least gives a > > PC: > > > > CortexA15_0: Trouble Halting Target CPU: (Error -1323 @ 0xC0223E0C) > > Device failed to enter debug/halt mode because pipeline is stalled. > > Power-cycle the board. If error persists, confirm configuration > > and/or > > try more reliable JTAG settings (e.g. lower TCLK). (Emulation > > package > > 9.2.0.00002) > > > > The second core is just sitting at WFI, don't think there is > > anything > > suspicious about that. > > > > Looking at the kernel disassembly this is the actual register read > > (ldr > > r0, [r1]) part of omap4_prminst_read_inst_reg. > > > > My best guess is that it is trying to read from a register that > > doesn't > > exist or isn't responding due to the current power configuration, > > but I > > wonder if anyone has seen this before or has any more clues on how > > to > > debug this? It's a shame that I can't seem to see what r1 is or get > > a > > backtrace. It looks like it might be possible to set some kind of > > timeout on the interconnect, has anyone tried something like that > > to > > debug this kind of issue? > > > > Best > > > > David Shah > > > > > > _______________________________________________ > https://projects.goldelico.com/p/gta04-kernel/ > Letux-kernel mailing list > Letux-kernel@xxxxxxxxxxxxxxx > http://lists.goldelico.com/mailman/listinfo.cgi/letux-kernel