Re: nouveau shuts the machine down with v3.9-rc1 (temperature (72 C) hit the 'shutdown' threshold).

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 04/03/2013 22:41, Konrad Rzeszutek Wilk wrote:
Pls CC me in case you would like me also to test them with the mdelay patch.

Hi Konrad,

Marcin proposed me another explanation for the issue you are seeing and it made me look again at the code.

I don't have enough nv4x hw to test all the conditions but with the attached patches, you may get a saner
behaviour than a computer that shut-downs whenever you turn it on (like a "most useless machine ever").
The most important patch is the 8th one.

Please try applying them on top of your 3.9-rc1 kernel and send me back your kernel logs + sensors output.

Cheers,
Martin

PS: The attached patches are parts of my current thermal-related queue. I'll post them soon to the list.
- http://gitorious.org/linux-nouveau-pm/linux-nouveau-pm/commits/thermal

>From e2a10f1e7060b0cbabb032590ec2588f952016fc Mon Sep 17 00:00:00 2001
From: Martin Peres <martin.peres@xxxxxxxx>
Date: Tue, 5 Mar 2013 10:26:30 +0100
Subject: [PATCH 1/8] drm/nv40/therm: improve selection between the old and the
 new style

The condition to select between the old and new style was a thinko
as rnndb orders chipsets based on their release date (or general
chronologie hw-wise) and not based on their chipset number.

As the nv40 family is a mess when it comes to numbers, this patch
introduces a switch-based selection between the old and new style.

Signed-off-by: Martin Peres <martin.peres@xxxxxxxx>
---
 drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c | 50 ++++++++++++++++++------
 1 file changed, 38 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
index 0f5363e..9d9ecaa 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
@@ -29,42 +29,68 @@ struct nv40_therm_priv {
 	struct nouveau_therm_priv base;
 };
 
+enum nv40_sensor_style { INVALID_STYLE = -1, OLD_STYLE = 0, NEW_STYLE = 1 };
+
+static enum nv40_sensor_style
+nv40_is_older_style_sensor(struct nouveau_therm *therm)
+{
+	struct nouveau_device *device = nv_device(therm);
+
+	switch (device->chipset) {
+	case 0x43:
+	case 0x44:
+	case 0x4a:
+	case 0x47:
+		return OLD_STYLE;
+
+	case 0x46:
+	case 0x49:
+	case 0x4b:
+	case 0x4e:
+	case 0x4c:
+	case 0x67:
+	case 0x68:
+	case 0x63:
+		return NEW_STYLE;
+	default:
+		return INVALID_STYLE;
+	}
+}
+
 static int
 nv40_sensor_setup(struct nouveau_therm *therm)
 {
-	struct nouveau_device *device = nv_device(therm);
+	enum nv40_sensor_style style = nv40_is_older_style_sensor(therm);
 
 	/* enable ADC readout and disable the ALARM threshold */
-	if (device->chipset >= 0x46) {
+	if (style == NEW_STYLE) {
 		nv_mask(therm, 0x15b8, 0x80000000, 0);
 		nv_wr32(therm, 0x15b0, 0x80003fff);
 		mdelay(10); /* wait for the temperature to stabilize */
 		return nv_rd32(therm, 0x15b4) & 0x3fff;
-	} else {
+	} else if (style == OLD_STYLE) {
 		nv_wr32(therm, 0x15b0, 0xff);
 		return nv_rd32(therm, 0x15b4) & 0xff;
-	}
+	} else
+		return -ENODEV;
 }
 
 static int
 nv40_temp_get(struct nouveau_therm *therm)
 {
 	struct nouveau_therm_priv *priv = (void *)therm;
-	struct nouveau_device *device = nv_device(therm);
 	struct nvbios_therm_sensor *sensor = &priv->bios_sensor;
+	enum nv40_sensor_style style = nv40_is_older_style_sensor(therm);
 	int core_temp;
 
-	if (device->chipset >= 0x46) {
+	if (style == NEW_STYLE) {
 		nv_wr32(therm, 0x15b0, 0x80003fff);
 		core_temp = nv_rd32(therm, 0x15b4) & 0x3fff;
-	} else {
+	} else if (style == OLD_STYLE) {
 		nv_wr32(therm, 0x15b0, 0xff);
 		core_temp = nv_rd32(therm, 0x15b4) & 0xff;
-	}
-
-	/* Setup the sensor if the temperature is 0 */
-	if (core_temp == 0)
-		core_temp = nv40_sensor_setup(therm);
+	} else
+		return -ENODEV;
 
 	if (sensor->slope_div == 0)
 		sensor->slope_div = 1;
-- 
1.8.1.5

>From 38893c70faa9bbedded908694c3344d755cb05bf Mon Sep 17 00:00:00 2001
From: Martin Peres <martin.peres@xxxxxxxx>
Date: Tue, 5 Mar 2013 10:35:20 +0100
Subject: [PATCH 2/8] drm/nv40/therm: increase the sensor's settling delay to
 20ms

Based on my experience, 10ms wasn't always enough. Let's bump that
to a little more.

If this turns out to be insufficient-enough again, then an approach
based on letting the sensor settle for several seconds before starting
polling on the temperature would be better suited. This way, boot time
wouldn't be impacted by those waits too much.

Signed-off-by: Martin Peres <martin.peres@xxxxxxxx>
---
 drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
index 9d9ecaa..818060e 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
@@ -66,10 +66,11 @@ nv40_sensor_setup(struct nouveau_therm *therm)
 	if (style == NEW_STYLE) {
 		nv_mask(therm, 0x15b8, 0x80000000, 0);
 		nv_wr32(therm, 0x15b0, 0x80003fff);
-		mdelay(10); /* wait for the temperature to stabilize */
+		mdelay(20); /* wait for the temperature to stabilize */
 		return nv_rd32(therm, 0x15b4) & 0x3fff;
 	} else if (style == OLD_STYLE) {
 		nv_wr32(therm, 0x15b0, 0xff);
+		mdelay(20); /* wait for the temperature to stabilize */
 		return nv_rd32(therm, 0x15b4) & 0xff;
 	} else
 		return -ENODEV;
-- 
1.8.1.5

>From 9b9ab3da97590816b9c61e9d272bbcb712db8d42 Mon Sep 17 00:00:00 2001
From: Martin Peres <martin.peres@xxxxxxxx>
Date: Tue, 5 Mar 2013 10:44:12 +0100
Subject: [PATCH 3/8] drm/nouveau/therm: do not make assumptions on temperature

In nouveau_therm_sensor_event, temperature is stored as an uint8_t
even though the original interface returns an int.

This change should make it more obvious when the sensor is either
very-ill-calibrated or when we selected the wrong sensor style
on the nv40 family.

Signed-off-by: Martin Peres <martin.peres@xxxxxxxx>
---
 drivers/gpu/drm/nouveau/core/subdev/therm/temp.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/temp.c b/drivers/gpu/drm/nouveau/core/subdev/therm/temp.c
index b37624a..0a17b95 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/temp.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/temp.c
@@ -106,16 +106,16 @@ void nouveau_therm_sensor_event(struct nouveau_therm *therm,
 	const char *thresolds[] = {
 		"fanboost", "downclock", "critical", "shutdown"
 	};
-	uint8_t temperature = therm->temp_get(therm);
+	int temperature = therm->temp_get(therm);
 
 	if (thrs < 0 || thrs > 3)
 		return;
 
 	if (dir == NOUVEAU_THERM_THRS_FALLING)
-		nv_info(therm, "temperature (%u C) went below the '%s' threshold\n",
+		nv_info(therm, "temperature (%i C) went below the '%s' threshold\n",
 			temperature, thresolds[thrs]);
 	else
-		nv_info(therm, "temperature (%u C) hit the '%s' threshold\n",
+		nv_info(therm, "temperature (%i C) hit the '%s' threshold\n",
 			temperature, thresolds[thrs]);
 
 	active = (dir == NOUVEAU_THERM_THRS_RISING);
-- 
1.8.1.5

>From c277fbe506f32cf624d302eafa210746ed796b96 Mon Sep 17 00:00:00 2001
From: Martin Peres <martin.peres@xxxxxxxx>
Date: Tue, 5 Mar 2013 10:58:59 +0100
Subject: [PATCH 4/8] drm/nouveau/therm: remove some confusion introduced by
 therm_mode

The kernel message "[  PTHERM][0000:01:00.0] Thermal management: disabled"
is misleading as it actually means "fan management: disabled".

This patch fixes both the source and the message to improve readability.

Signed-off-by: Martin Peres <martin.peres@xxxxxxxx>
---
 drivers/gpu/drm/nouveau/core/include/subdev/therm.h |  2 +-
 drivers/gpu/drm/nouveau/core/subdev/therm/base.c    | 10 +++++-----
 drivers/gpu/drm/nouveau/core/subdev/therm/priv.h    |  2 +-
 drivers/gpu/drm/nouveau/core/subdev/therm/temp.c    |  2 +-
 4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/core/include/subdev/therm.h b/drivers/gpu/drm/nouveau/core/include/subdev/therm.h
index 6b17b61..0b20fc0 100644
--- a/drivers/gpu/drm/nouveau/core/include/subdev/therm.h
+++ b/drivers/gpu/drm/nouveau/core/include/subdev/therm.h
@@ -4,7 +4,7 @@
 #include <core/device.h>
 #include <core/subdev.h>
 
-enum nouveau_therm_mode {
+enum nouveau_therm_fan_mode {
 	NOUVEAU_THERM_CTRL_NONE = 0,
 	NOUVEAU_THERM_CTRL_MANUAL = 1,
 	NOUVEAU_THERM_CTRL_AUTO = 2,
diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/base.c b/drivers/gpu/drm/nouveau/core/subdev/therm/base.c
index f794dc8..321a55b 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/base.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/base.c
@@ -134,7 +134,7 @@ nouveau_therm_alarm(struct nouveau_alarm *alarm)
 }
 
 int
-nouveau_therm_mode(struct nouveau_therm *therm, int mode)
+nouveau_therm_fan_mode(struct nouveau_therm *therm, int mode)
 {
 	struct nouveau_therm_priv *priv = (void *)therm;
 	struct nouveau_device *device = nv_device(therm);
@@ -152,7 +152,7 @@ nouveau_therm_mode(struct nouveau_therm *therm, int mode)
 	if (priv->mode == mode)
 		return 0;
 
-	nv_info(therm, "Thermal management: %s\n", name[mode]);
+	nv_info(therm, "fan management: %s\n", name[mode]);
 	nouveau_therm_update(therm, mode);
 	return 0;
 }
@@ -213,7 +213,7 @@ nouveau_therm_attr_set(struct nouveau_therm *therm,
 		priv->fan->bios.max_duty = value;
 		return 0;
 	case NOUVEAU_THERM_ATTR_FAN_MODE:
-		return nouveau_therm_mode(therm, value);
+		return nouveau_therm_fan_mode(therm, value);
 	case NOUVEAU_THERM_ATTR_THRS_FAN_BOOST:
 		priv->bios_sensor.thrs_fan_boost.temp = value;
 		priv->sensor.program_alarms(therm);
@@ -263,7 +263,7 @@ _nouveau_therm_init(struct nouveau_object *object)
 		return ret;
 
 	if (priv->suspend >= 0)
-		nouveau_therm_mode(therm, priv->mode);
+		nouveau_therm_fan_mode(therm, priv->mode);
 	priv->sensor.program_alarms(therm);
 	return 0;
 }
@@ -317,7 +317,7 @@ nouveau_therm_preinit(struct nouveau_therm *therm)
 	nouveau_therm_sensor_ctor(therm);
 	nouveau_therm_fan_ctor(therm);
 
-	nouveau_therm_mode(therm, NOUVEAU_THERM_CTRL_NONE);
+	nouveau_therm_fan_mode(therm, NOUVEAU_THERM_CTRL_NONE);
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/priv.h b/drivers/gpu/drm/nouveau/core/subdev/therm/priv.h
index 06b9870..d8483dd 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/priv.h
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/priv.h
@@ -102,7 +102,7 @@ struct nouveau_therm_priv {
 	struct i2c_client *ic;
 };
 
-int nouveau_therm_mode(struct nouveau_therm *therm, int mode);
+int nouveau_therm_fan_mode(struct nouveau_therm *therm, int mode);
 int nouveau_therm_attr_get(struct nouveau_therm *therm,
 		       enum nouveau_therm_attr_type type);
 int nouveau_therm_attr_set(struct nouveau_therm *therm,
diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/temp.c b/drivers/gpu/drm/nouveau/core/subdev/therm/temp.c
index 0a17b95..441f60b 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/temp.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/temp.c
@@ -123,7 +123,7 @@ void nouveau_therm_sensor_event(struct nouveau_therm *therm,
 	case NOUVEAU_THERM_THRS_FANBOOST:
 		if (active) {
 			nouveau_therm_fan_set(therm, true, 100);
-			nouveau_therm_mode(therm, NOUVEAU_THERM_CTRL_AUTO);
+			nouveau_therm_fan_mode(therm, NOUVEAU_THERM_CTRL_AUTO);
 		}
 		break;
 	case NOUVEAU_THERM_THRS_DOWNCLOCK:
-- 
1.8.1.5

>From 60dce3447342d7bb1122e90c3f0aa63573e0a9b4 Mon Sep 17 00:00:00 2001
From: Martin Peres <martin.peres@xxxxxxxx>
Date: Tue, 5 Mar 2013 10:38:37 +0100
Subject: [PATCH 8/8] drm/nv40/therm: <DO NOT PUSH> move nv4c to the newer
 temperature-reading style

This is a guess made by joi and that may quite likely be true
---
 drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
index d546ada..2b24667 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c
@@ -41,13 +41,13 @@ nv40_is_older_style_sensor(struct nouveau_therm *therm)
 	case 0x44:
 	case 0x4a:
 	case 0x47:
+	case 0x4c:
 		return OLD_STYLE;
 
 	case 0x46:
 	case 0x49:
 	case 0x4b:
 	case 0x4e:
-	case 0x4c:
 	case 0x67:
 	case 0x68:
 	case 0x63:
-- 
1.8.1.5

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel

[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux