On Fri, 01 Nov 2024 14:19:54 +0000, Johan Hovold <johan@xxxxxxxxxx> wrote: > > On Fri, Nov 01, 2024 at 02:08:24PM +0000, Marc Zyngier wrote: > > > I'm seeing similar things indeed. Randomly grepping in cpufreq/policy* > > results in hard resets, although I don't get much on the serial > > console when that happens. Interestingly, I also see some errors in > > dmesg at boot time: > > > > maz@semi-fraudulent:~$ dmesg| grep -i scmi > > [ 0.966175] scmi_core: SCMI protocol bus registered > > [ 7.929710] arm-scmi arm-scmi.2.auto: Using scmi_mailbox_transport > > [ 7.939059] arm-scmi arm-scmi.2.auto: SCMI max-rx-timeout: 30ms > > [ 7.945567] arm-scmi arm-scmi.2.auto: SCMI RAW Mode initialized for instance 0 > > [ 7.958348] arm-scmi arm-scmi.2.auto: SCMI RAW Mode COEX enabled ! > > [ 7.978303] arm-scmi arm-scmi.2.auto: SCMI Notifications - Core Enabled. > > [ 7.985351] arm-scmi arm-scmi.2.auto: SCMI Protocol v2.0 'Qualcomm:' Firmware version 0x20000 > > [ 8.033774] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.033902] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.036528] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.036744] arm-scmi arm-scmi.2.auto: Failed to add opps_by_lvl at 3801600 for NCC - ret:-16 > > [ 8.171232] scmi-perf-domain scmi_dev.4: Initialized 3 performance domains > > > > All these "Failed" are a bit worrying. Happy to put any theory to the > > test. > > Yes, those warnings indeed look troubling. Fortunately they appear to be > mostly benign and only indicate that the firmware is reporting duplicate > OPPs, which the kernel is now ignoring without any other side effects > than the warnings. Right. Not something that would explain the hard reset behaviour then. > > The side-effects and these remaining warnings are addressed by this > series: > > https://lore.kernel.org/all/20241030125512.2884761-1-quic_sibis@xxxxxxxxxxx/ > > but I think we should try to make the warnings a bit more informative > (and less scary) by printing something along the lines of: > > arm-scmi arm-scmi.0.auto: [Firmware Bug]: Ignoring duplicate OPP 3417600 for NCC > > instead. Indeed. Seeing [Firmware Bug] has a comforting feeling of familiarity... :) I wonder whether the same sort of reset happen on more "commercial" systems (such as some of the laptops). You expect that people look at the cpufreq stuff closely, and don't see things exploding like we are. M. -- Without deviation from the norm, progress is not possible.