On 04/02/2020 12:53, Valentin Schneider wrote: > We've been getting some sporadic failures on the big CPUs of a Pixel3 > running mainline [1], here is an example of a correct run (CPU4): > > | frequency (kHz) | sysbench events | > |-----------------+-----------------| > | 825600 | 236 | > | 1286400 | 369 | > | 1689600 | 483 | > | 2092800 | 600 | > | 2476800 | 711 | > > and here is a failed one (still CPU4): > > | frequency (kHz) | sysbench events | > |-----------------+-----------------| > | 825600 | 234 | > | 1286400 | 369 | > | 1689600 | 449 | > | 2092800 | 600 | > | 2476800 | 355 | > > > We've encountered something like this in the past with the exact same > test on h960 [2] but it is much harder to reproduce reliably this time > around. > > I haven't found much time to dig into this; I did get a run of ~100 > iterations with about ~15 failures, but nothing cpufreq related showed up in > dmesg. I briefly suspected fast-switch, but it's only used by schedutil, so > in this test I would expect the frequency transition to be complete before we > even try to start executing sysbench. > I've been adding some more debug stuff in that test case following some of Lukasz' recommendations, and I still don't find anything that would explain what I'm seeing. The raw output of the test is: CPU0: 300000: 61 576000: 114 825600: 172 1056000: 221 1324800: 278 1612800: 339 CPU4: 825600: 236 1286400: 368 1689600: 479 2092800: 420 <---} 2476800: 339 <---} Both of these are not monotonically increasing... /sys/kernel/debug/clk/clk_summary doesn't seem to include CPU clocks, or doesn't get updated because I see no diff from one frequency to another (even between lowest & highest tested frequency) /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state does get updated, and seems to hint that I am getting the frequency I'm asking for: [2020-02-12 14:48:21,706] 2476800 39544 [2020-02-12 14:48:23,929] 2476800 39745 There's about ~10% (200ms) missing here, but that shouldn't lead to about half the expected performance (I get ~710 "score" out of that 2.477GHz freq on non-failing runs). I also made sure to read back /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq and I do see the value I've asked for. Finally, I also probed the thermal state via /sys/class/thermal/cooling_device*/cur_state and they are *always* 0 (i.e., no throttling) right after finishing the execution of the benchmark, which should be close to the "hottest" point. So AFAICT there is nothing on the cpufreq side that hints at a slow or unsuccessful frequency transition. Can FW mess about frequencies without notifying the kernel? > If anyone has the time and will to look into this, that would be much > appreciated. > > [1]: https://git.linaro.org/people/amit.pundir/linux.git/log/?h=blueline-mainline-tracking > [2]: https://lore.kernel.org/lkml/d3ede0ab-b635-344c-faba-a9b1531b7f05@xxxxxxx/ > > Cheers, > Valentin >