Re: [PATCH v2] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jul 3, 2024 at 4:00 PM Daniel Lezcano <daniel.lezcano@xxxxxxxxxx> wrote:
>
> On 03/07/2024 14:43, neil.armstrong@xxxxxxxxxx wrote:
> > Hi,
> >
> > On 03/07/2024 14:25, Daniel Lezcano wrote:
> >>
> >> Hi Neil,
> >>
> >> it seems there is something wrong with the driver actually.
> >>
> >> There can be a moment where the sensor is not yet initialized for
> >> different reason, so reading the temperature fails. The routine will
> >> just retry until the sensor gets ready.
> >>
> >> Having these errors seem to me that the sensor for this specific
> >> thermal zone is never ready which may be the root cause of your issue.
> >> The change is spotting this problem IMO.
> >
> > Probably, but it gets printed every second until system shutdown, but
> > only for a single thermal_zone.
> >
> > Using v1 of Rafael's patch makes the message disappear completely.
>
> Yes, because you have probably the thermal zone polling delay set to
> zero, thus it fails the first time and does no longer try to set it up
> again. The V1 is an incomplete fix.
>
> Very likely the problem is in the sensor platform driver, or in the
> thermal zone description in the device tree which describes a non
> functional thermal zone.

I agree, but polling this useless thermal zone forever is not
particularly useful.

I was kind of afraid that something like this would happen, but then I
didn't want to complicate the patch unnecessarily until I knew that it
really would happen.

So attached is a modification of the $subject patch that will double
the temperature recheck delay after every failed attempt to get the
zone temperature and it will give up eventually (in this particular
version, after the recheck delay exceeds 30 s).

I would appreciate giving it a go (obviously, by replacing the
$subject one with it).
From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
Subject: [PATCH v3] thermal: core: Call monitor_thermal_zone() if zone temperature is invalid

Commit 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip()
if zone temperature is invalid") caused __thermal_zone_device_update()
to return early if the current thermal zone temperature was invalid.

This was done to avoid running handle_thermal_trip() and governor
callbacks in that case which led to confusion.  However, it went too
far because monitor_thermal_zone() still needs to be called even when
the zone temperature is invalid to ensure that it will be updated
eventually in case thermal polling is enabled and the driver has no
other means to notify the core of zone temperature changes (for example,
it does not register an interrupt handler or ACPI notifier).

Also if the .set_trips() zone callback is expected to set up monitoring
interrupts for a thermal zone, it needs to be provided with valid
boundaries and that can only be done if the zone temperature is known.

Accordingly, to ensure that __thermal_zone_device_update() will
run again after a failing zone temperature check, make it call
monitor_thermal_zone() regardless of whether or not the zone
temperature is valid and make the latter schedule a thermal zone
temperature update if the zone temperature is invalid even if
polling is not enabled for the thermal zone (however, if this
continues to fail, give up after some time).

Fixes: 202aa0d4bb53 ("thermal: core: Do not call handle_thermal_trip() if zone temperature is invalid")
Reported-by: Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
---
 drivers/thermal/thermal_core.c |   13 ++++++++++++-
 drivers/thermal/thermal_core.h |    9 +++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -300,6 +300,14 @@ static void monitor_thermal_zone(struct
 		thermal_zone_device_set_polling(tz, tz->passive_delay_jiffies);
 	else if (tz->polling_delay_jiffies)
 		thermal_zone_device_set_polling(tz, tz->polling_delay_jiffies);
+	else if (tz->temperature == THERMAL_TEMP_INVALID &&
+		 tz->recheck_delay_jiffies <= THERMAL_MAX_RECHECK_DELAY) {
+		thermal_zone_device_set_polling(tz, tz->recheck_delay_jiffies);
+		/* Double the recheck delay for the next attempt. */
+		tz->recheck_delay_jiffies += tz->recheck_delay_jiffies;
+		if (tz->recheck_delay_jiffies > THERMAL_MAX_RECHECK_DELAY)
+			dev_info(&tz->device, "Temperature unknown, giving up\n");
+	}
 }
 
 static struct thermal_governor *thermal_get_tz_governor(struct thermal_zone_device *tz)
@@ -430,6 +438,7 @@ static void update_temperature(struct th
 
 	tz->last_temperature = tz->temperature;
 	tz->temperature = temp;
+	tz->recheck_delay_jiffies = 1;
 
 	trace_thermal_temperature(tz);
 
@@ -514,7 +523,7 @@ void __thermal_zone_device_update(struct
 	update_temperature(tz);
 
 	if (tz->temperature == THERMAL_TEMP_INVALID)
-		return;
+		goto monitor;
 
 	tz->notify_event = event;
 
@@ -536,6 +545,7 @@ void __thermal_zone_device_update(struct
 
 	thermal_debug_update_trip_stats(tz);
 
+monitor:
 	monitor_thermal_zone(tz);
 }
 
@@ -1438,6 +1448,7 @@ thermal_zone_device_register_with_trips(
 
 	thermal_set_delay_jiffies(&tz->passive_delay_jiffies, passive_delay);
 	thermal_set_delay_jiffies(&tz->polling_delay_jiffies, polling_delay);
+	tz->recheck_delay_jiffies = 1;
 
 	/* sys I/F */
 	/* Add nodes that are always present via .groups */
Index: linux-pm/drivers/thermal/thermal_core.h
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.h
+++ linux-pm/drivers/thermal/thermal_core.h
@@ -67,6 +67,8 @@ struct thermal_governor {
  * @polling_delay_jiffies: number of jiffies to wait between polls when
  *			checking whether trip points have been crossed (0 for
  *			interrupt driven systems)
+ * @recheck_delay_jiffies: delay after a failed thermal zone temperature check
+ * 			before attempting to check it again
  * @temperature:	current temperature.  This is only for core code,
  *			drivers should use thermal_zone_get_temp() to get the
  *			current temperature
@@ -108,6 +110,7 @@ struct thermal_zone_device {
 	int num_trips;
 	unsigned long passive_delay_jiffies;
 	unsigned long polling_delay_jiffies;
+	unsigned long recheck_delay_jiffies;
 	int temperature;
 	int last_temperature;
 	int emul_temperature;
@@ -133,6 +136,12 @@ struct thermal_zone_device {
 	struct thermal_trip_desc trips[] __counted_by(num_trips);
 };
 
+/*
+ * Maximum delay after a failing thermal zone temperature check before
+ * attempting to check it again (in jiffies).
+ */
+#define THERMAL_MAX_RECHECK_DELAY	(30 * HZ)
+
 /* Default Thermal Governor */
 #if defined(CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE)
 #define DEFAULT_THERMAL_GOVERNOR       "step_wise"

[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [Linux for Sparc]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux