Hi, On Wed, Jul 3, 2024 at 1:04 PM Neil Armstrong <neil.armstrong@xxxxxxxxxx> wrote: > > Hi, > > On 28/06/2024 14:10, Rafael J. Wysocki wrote: > > From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx> > > > > Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() > > if zone temperature is invalid") caused __thermal_zone_device_update() > > to return early if the current thermal zone temperature was invalid. > > > > This was done to avoid running handle_thermal_trip() and governor > > callbacks in that case which led to confusion. However, it went too > > far because monitor_thermal_zone() still needs to be called even when > > the zone temperature is invalid to ensure that it will be updated > > eventually in case thermal polling is enabled and the driver has no > > other means to notify the core of zone temperature changes (for example, > > it does not register an interrupt handler or ACPI notifier). > > > > Also if the .set_trips() zone callback is expected to set up monitoring > > interrupts for a thermal zone, it has to be provided with valid > > boundaries and that can only happen if the zone temperature is known. > > > > Accordingly, to ensure that __thermal_zone_device_update() will > > run again after a failing zone temperature check, make it call > > monitor_thermal_zone() regardless of whether or not the zone > > temperature is valid and make the latter schedule a thermal zone > > temperature update if the zone temperature is invalid even if > > polling is not enabled for the thermal zone. > > > > Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid") > > Reported-by: Daniel Lezcano <daniel.lezcano@xxxxxxxxxx> > > Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx> > > --- > > drivers/thermal/thermal_core.c | 5 ++++- > > drivers/thermal/thermal_core.h | 6 ++++++ > > 2 files changed, 10 insertions(+), 1 deletion(-) > > > > Index: linux-pm/drivers/thermal/thermal_core.c > > =================================================================== > > --- linux-pm.orig/drivers/thermal/thermal_core.c > > +++ linux-pm/drivers/thermal/thermal_core.c > > @@ -300,6 +300,8 @@ static void monitor_thermal_zone(struct > > thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies); > > else if (tz->polling_delay_jiffies) > > thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies); > > + else if (tz->temperature == THERMAL_TEMP_INVALID) > > + thermal_zone_device_set_polling(tz, msecs_to_jiffies(THERMAL_RECHECK_DELAY_MS)); > > } > > > > static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz) > > @@ -514,7 +516,7 @@ void __thermal_zone_device_update(struct > > update_temperature(tz); > > > > if (tz->temperature == THERMAL_TEMP_INVALID) > > - return; > > + goto monitor; > > > > tz->notify_event = event; > > > > @@ -536,6 +538,7 @@ void __thermal_zone_device_update(struct > > > > thermal_debug_update_trip_stats(tz); > > > > +monitor: > > monitor_thermal_zone(tz); > > } > > > > Index: linux-pm/drivers/thermal/thermal_core.h > > =================================================================== > > --- linux-pm.orig/drivers/thermal/thermal_core.h > > +++ linux-pm/drivers/thermal/thermal_core.h > > @@ -133,6 +133,12 @@ struct thermal_zone_device { > > struct thermal_trip_desc trips[] __counted_by(num_trips); > > }; > > > > +/* > > + * Default delay after a failing thermal zone temperature check before > > + * attempting to check it again. > > + */ > > +#define THERMAL_RECHECK_DELAY_MS 100 > > + > > /* Default Thermal Governor */ > > #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE) > > #define DEFAULT_THERMAL_GOVERNOR "step_wise" > > > > > > > > > > This patch on next-20240702 makes Qualcomm HDK8350, HDK8450, QRD8550, HDK8560, QRD8650 & HDK8650 output in loop: > > thermal thermal_zoneXX: failed to read out thermal zone (-19) Is the loop endless? If not, how many times does the message get printed? If I'm not mistaken, it would be printed at least once without the commit in question. Can you please check that? Also, can you check the previous version of the patch in question: https://lore.kernel.org/linux-pm/2745114.mvXUDI8C0e@xxxxxxxxxxxxx/ and see if it has the same problem (just apply it instead of the $subject one). Thanks!