Kevin Hilman <khilman@xxxxxxxxxxxx> writes: > Kever Yang <kever.yang@xxxxxxxxxxxxxx> writes: > >> Hi Kevin, Heiko, >> >> On 2019/8/22 上午2:59, Kevin Hilman wrote: >>> Hi Heiko, >>> >>> Heiko Stuebner <heiko@xxxxxxxxx> writes: >>> >>>> Am Dienstag, 13. August 2019, 19:35:31 CEST schrieb Kevin Hilman: >>>>> [ resent with correct addr for linux-rockchip list ] >>>>> >>>>> Mark Brown <broonie@xxxxxxxxxx> writes: >>>>> >>>>>> On Thu, Jul 18, 2019 at 04:28:08AM -0700, kernelci.org bot wrote: >>>>>> >>>>>> Today's -next started failing to boot defconfig on rk3399-firefly: >>>>>> >>>>>>> arm64: >>>>>>> defconfig: >>>>>>> gcc-8: >>>>>>> rk3399-firefly: 1 failed lab >>>>>> It hits a BUG() trying to set up cpufreq: >>>>>> >>>>>> [ 87.381606] cpufreq: cpufreq_online: CPU0: Running at unlisted freq: 200000 KHz >>>>>> [ 87.393244] cpufreq: cpufreq_online: CPU0: Unlisted initial frequency changed to: 408000 KHz >>>>>> [ 87.469777] cpufreq: cpufreq_online: CPU4: Running at unlisted freq: 12000 KHz >>>>>> [ 87.488595] cpu cpu4: _generic_set_opp_clk_only: failed to set clock rate: -22 >>>>>> [ 87.491881] cpufreq: __target_index: Failed to change cpu frequency: -22 >>>>>> [ 87.495335] ------------[ cut here ]------------ >>>>>> [ 87.496821] kernel BUG at drivers/cpufreq/cpufreq.c:1438! >>>>>> [ 87.498462] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP >>>>>> >>>>>> I'm struggling to see anything relevant in the diff from yesterday, the >>>>>> unlisted frequency warnings were there in the logs yesterday but no oops >>>>>> and I'm not seeing any changes in cpufreq, clk or anything relevant >>>>>> looking. >>>>>> >>>>>> Full bootlog and other info can be found here: >>>>>> >>>>>> https://kernelci.org/boot/id/5d302d8359b51498d049e983/ >>>>> I confirm that disabling CPUfreq in the defconfig (CONFIG_CPU_FREQ=n) >>>>> makes the firefly board start working again. >>>>> >>>>> Note that the default defconfig enables the "performance" CPUfreq >>>>> governor as the default governor, so during kernel boot, it will always >>>>> switch to the max frequency. >>>>> >>>>> For fun, I set the default governor to "userspace" so the kernel >>>>> wouldn't make any OPP changes, and that leads to a slightly more >>>>> informative splat[1] >>>>> >>>>> There is still an OPP change happening because the detected OPP is not >>>>> one that's listed in the table, so it tries to change to a listed OPP >>>>> and fails in the bowels of clk_set_rate() >>>> Though I think that might only be a symptom as well. >>>> Both the PLL setting code as well as the actual cpu-clock implementation >>>> is unchanged since 2017 (and runs just fine on all boards in my farm). >>>> >>>> One source for these issues is often the regulator supplying the cpu >>>> going haywire - aka the voltage not matching the opp. >>>> >>>> As in this error-case it's CPU4 being set, this would mean it might >>>> be the big cluster supplied by the external syr825 (fan5355 clone) >>>> that might act up. In the Firefly-rk3399 case this is even stranger. >>>> >>>> There is a discrepancy between the "fcs,suspend-voltage-selector" >>>> between different bootloader versions (how the selection-pin is set up), >>>> so the kernel might actually write his requested voltage to the wrong >>>> register (not the one for actual voltage, but the second set used for >>>> the suspend voltage). >>>> >>>> Did you by chance swap bootloaders at some point in recent past? >>> No, haven't touched bootloader since I initially setup the board. >> >> The CPU voltage does not affect by bootloader for kernel should have its >> own opp-table, >> >> the bootloader may only affect the center/logic power supply. >> >>> >>>> I'd assume [2] might actually be the same issue last year, though >>>> the CI-logs are not available anymore it seems. >>>> >>>> Could you try to set the vdd_cpu_b regulator to disabled, so that >>>> cpufreq for this cluster defers and see what happens? >>> Yes, this change[1] definitely makes things boot reliably again, so >>> there's defintiely something a bit unstable with this regulator, at >>> least on this firefly. >> >> Is it possible to target which patch introduce this bug? This board >> should have work correctly for a long time with upstream source code. > > Unfortunately, it seems to be a regular, but intermittent failure, so > bisection is not producing anything reliable. > > You can see that both in mainline[1] and in linux-next[2] there are > periodic failures, but it's hard to see any patterns. Even worse, I (re)tested mainline for versions that were previously passing (v5.2, v5.3-rc5) and they are also failing now. They work again if I disable that regulator as suggested by Heiko. So this is increasingly pointing to failing hardware. Kevin