RE: [PATCH] TCO Watchdog warning interrupt driver creation

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Any update?
Thanks,
François-Nicolas

-----Original Message-----
From: Muller, Francois-nicolas 
Sent: Thursday, February 12, 2015 11:14 AM
To: Rafael J. Wysocki
Cc: Guenter Roeck; Darren Hart; 'platform-driver-x86@xxxxxxxxxxxxxxx'; Linux ACPI Mailing List; linux-watchdog@xxxxxxxxxxxxxxx; Wim Van Sebroeck
Subject: RE: [PATCH] TCO Watchdog warning interrupt driver creation

Please find hereafter a new version of the patch with a documentation file and that builds with CONFIG_ACPI unset.

>From a7135e6b4bc7c91d6ac72a4f38157f7f2308615b Mon Sep 17 00:00:00 2001
From: Francois-Nicolas Muller <francois-nicolas.muller@xxxxxxxxx>
Date: Tue, 20 Jan 2015 14:55:42 +0100
Subject: [PATCH] Adding TCO watchdog warning interrupt handling.

This feature is useful to root cause watchdog expiration.
It is activated by boot parameter 'warn_irq' (disabled by default).

Upon first expiration of the TCO watchdog, a warning interrupt is fired, then the interrupt handler dumps registers and call stack of all available cpus.
TCO watchdog reloads with 2.4 seconds timeout for second expiration.

If CONFIG_ITCO_WARN_PANIC is set, the warning interrupt also calls panic() which notifies the panic handlers then reboots the platform, depending on CONFIG_PANIC_TIMEOUT value :

- If CONFIG_PANIC_TIMEOUT is zero or greater than 3 seconds, TCO watchdog will reset the platform if second expiration happens before TCO has been kicked again.

- If CONFIG_PANIC_TIMEOUT is < 0, platform will reboot immediately (emergency restart procedure).

- If CONFIG_PANIC_TIMEOUT is 1 or 2 seconds, platform will reboot after 1 or 2 seconds delay (emergency restart procedure).

See Documentation/watchdog/tco-wdt-warning-interrupt.txt for more details.

Change-Id: I7314a50812529423b117cf28e4a195a356da2f57
Signed-off-by: Francois-Nicolas Muller <francois-nicolas.muller@xxxxxxxxx>
---
 .../watchdog/tco-wdt-warning-interrupt.txt         |   85 ++++++++++++++++++++
 drivers/watchdog/Kconfig                           |   13 +++
 drivers/watchdog/iTCO_wdt.c                        |   80 ++++++++++++++++++
 3 files changed, 178 insertions(+)
 create mode 100644 Documentation/watchdog/tco-wdt-warning-interrupt.txt

diff --git a/Documentation/watchdog/tco-wdt-warning-interrupt.txt b/Documentation/watchdog/tco-wdt-warning-interrupt.txt
new file mode 100644
index 0000000..2e4eebf
--- /dev/null
+++ b/Documentation/watchdog/tco-wdt-warning-interrupt.txt
@@ -0,0 +1,85 @@
+Last reviewed: 02/12/2015
+
+                     TCO watchdog warning interrupt
+                 handled by drivers/watchdog/iTCO_wdt.c
+                      Documentation and code by
+       Francois-Nicolas Muller <francois-nicolas.muller@xxxxxxxxx>
+
+
+Introduction
+------------
+Intel TCO watchdog is intended to detect and recover from locks up of 
+the platform. It contains a countdown timer, that should be reloaded 
+on-time by software before reaching zero.
+
+If the platform locks up and is not able to reload the timer, then when 
+it reaches zero:
+- the timer is automatically reloaded with 04h and starts decrementing 
+again,
+- timeout bit is set in TCO1_STS register,
+- SMI or SCI interrupt is generated (optional).
+
+If it reaches zero a second time while timeout bit is set,
+- second_to_sts bit is set,
+- reset of the platform is initiated.
+
+At first timeout, the SMI (or SCI) can be used to provide debug 
+information about the system state and help on fixing the cause of the 
+hang. This is the "warning interrupt".
+
+Warning interrupt
+-----------------
+Warning interrupt handler is called when system is hung, so it is 
+useful to gather maximum information about system state at this point 
+for root-causing the issue.
+
+When the interrupt occurs,
+- call stacks of all running cpus are dumped,
+- panic() is called (optional)
+
+Enabling the warning interrupt
+------------------------------
+Boot parameter "warn_irq" (boolean) enables warning interrupt 
+generation at first timer expiration (disabled by default).
+
+As this is a command line option, configuration can be changed easily 
+without building again the code.
+
+Enabling panic upon warning interrupt
+-------------------------------------
+Warning interrupt handler can call panic() when Kconfig option 
+CONFIG_ITCO_WARNING_PANIC is set.
+
+panic() call is useful in case of some panic handlers have been 
+registered and need to be run at this time.
+
+When CONFIG_ITCO_WARN_PANIC is set,
+- If CONFIG_PANIC_TIMEOUT is zero or greater than 3 seconds, TCO 
+watchdog will
+  reset the platform if second expiration happens before TCO has been 
+kicked
+  again.
+- If CONFIG_PANIC_TIMEOUT is < 0, platform will reboot immediately 
+(emergency
+  restart procedure).
+- If CONFIG_PANIC_TIMEOUT is 1 or 2 seconds, platform will reboot after 
+1 or 2
+  seconds delay (emergency restart procedure).
+
+SCI vs SMI
+----------
+For the moment, the TCO watchdog warning interrupt feature is only 
+available for platforms that are able to trigger a SCI upon first expiration of TCO watchdog.
+
+There is no support of the SMI option yet.
+
+ACPI configuration
+------------------
+Bios is configuring the GPE associated to the warning interrupt. The 
+driver uses acpi tables to get the GPE number.
+
+This change is intended for Intel Cherrytrail platform. As TCO watchdog 
+is part of lpc_ich module, its _HID is used in the driver to retrieve 
+GPE configuration from Bios.
+
+If no GPE information is provided by the Bios, the interrupt is not 
+handled and appears in the dmesg log as a warning. Second timeout is 
+still able to trigger a reset.
+
+-- Francois-Nicolas Muller
+   (francois-nicolas.muller@xxxxxxxxx)
diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index 79d2589..41f3647 100644
--- a/drivers/watchdog/Kconfig
+++ b/drivers/watchdog/Kconfig
@@ -674,6 +674,19 @@ config ITCO_VENDOR_SUPPORT
 	  devices. At this moment we only have additional support for some
 	  SuperMicro Inc. motherboards.
 
+config ITCO_WARNING_PANIC
+	bool "Intel TCO Timer/Watchdog panic on warning interrupt"
+	depends on ITCO_WDT && ACPI
+	default y
+	---help---
+	  Force a call to panic() when TCO warning interrupt occurs.
+
+	  Warning interrupt happens if warn_irq module parameter is set and
+	  TCO timer first expires.
+
+	  If not set, only cpu backtraces are dumped, no call to panic() and
+	  no notification of panic are done.
+
 config IT8712F_WDT
 	tristate "IT8712F (Smart Guardian) Watchdog Timer"
 	depends on X86
diff --git a/drivers/watchdog/iTCO_wdt.c b/drivers/watchdog/iTCO_wdt.c index e802a54..a25794c 100644
--- a/drivers/watchdog/iTCO_wdt.c
+++ b/drivers/watchdog/iTCO_wdt.c
@@ -49,6 +49,8 @@
 /* Module and version information */
 #define DRV_NAME	"iTCO_wdt"
 #define DRV_VERSION	"1.11"
+#define DRV_NAME_ACPI	"iTCO_wdt_wirq"
+#define TCO_CLASS	DRV_NAME
 
 /* Includes */
 #include <linux/module.h>		/* For module specific items */
@@ -68,6 +70,11 @@
 #include <linux/pm.h>			/* For suspend/resume */
 #include <linux/mfd/core.h>
 #include <linux/mfd/lpc_ich.h>
+#include <linux/nmi.h>
+#ifdef CONFIG_ACPI
+#include <linux/acpi.h>
+#include <acpi/actypes.h>
+#endif
 
 #include "iTCO_vendor.h"
 
@@ -107,6 +114,14 @@ static struct {		/* this is private data for the iTCO_wdt device */
 	bool started;
 } iTCO_wdt_private;
 
+#ifdef CONFIG_ACPI
+static const struct acpi_device_id iTCO_wdt_ids[] = {
+	{"8086229C", 0},
+	{"", 0},
+};
+MODULE_DEVICE_TABLE(acpi, iTCO_wdt_ids); #endif
+
 /* module parameters */
 #define WATCHDOG_TIMEOUT 30	/* 30 sec default heartbeat */
 static int heartbeat = WATCHDOG_TIMEOUT;  /* in seconds */ @@ -126,6 +141,15 @@ module_param(turn_SMI_watchdog_clear_off, int, 0);  MODULE_PARM_DESC(turn_SMI_watchdog_clear_off,
 	"Turn off SMI clearing watchdog (depends on TCO-version)(default=1)");
 
+static bool warn_irq;
+module_param(warn_irq, bool, 0);
+MODULE_PARM_DESC(warn_irq,
+	"Dump all cpus backtraces at first watchdog timer expiration 
+(default=0)");
+
+#ifdef CONFIG_ACPI
+static bool warn_irq_panic = CONFIG_ITCO_WARNING_PANIC; #endif
+
 /*
  * Some TCO specific functions
  */
@@ -200,6 +224,37 @@ static int iTCO_wdt_unset_NO_REBOOT_bit(void)
 	return ret; /* returns: 0 = OK, -EIO = Error */  }
 
+#ifdef CONFIG_ACPI
+static u32 iTCO_wdt_wirq(acpi_handle gpe_device, u32 gpe, void 
+*context) {
+	trigger_all_cpu_backtrace();
+	if (warn_irq_panic)
+		panic("Kernel Watchdog");
+
+	return IRQ_HANDLED;
+}
+
+static int iTCO_wdt_acpi_add(struct acpi_device *device) {
+	unsigned long long gpe;
+	acpi_status status;
+
+	status = acpi_evaluate_integer(device->handle, "_GPE", NULL, &gpe);
+	if (ACPI_FAILURE(status))
+		return -EINVAL;
+
+	status = acpi_install_gpe_handler(NULL, gpe, ACPI_GPE_EDGE_TRIGGERED,
+					  iTCO_wdt_wirq, NULL);
+	if (ACPI_FAILURE(status))
+		return -ENODEV;
+
+	acpi_enable_gpe(NULL, gpe);
+
+	pr_debug("interrupt=SCI GPE=0x%02llx", gpe);
+	return 0;
+}
+#endif
+
 static int iTCO_wdt_start(struct watchdog_device *wd_dev)  {
 	unsigned int val;
@@ -628,6 +683,17 @@ static struct platform_driver iTCO_wdt_driver = {
 	},
 };
 
+#ifdef CONFIG_ACPI
+static struct acpi_driver iTCO_wdt_acpi_driver = {
+	.name = DRV_NAME_ACPI,
+	.class = TCO_CLASS,
+	.ids = iTCO_wdt_ids,
+	.ops = {
+		.add = iTCO_wdt_acpi_add,
+	},
+};
+#endif
+
 static int __init iTCO_wdt_init_module(void)  {
 	int err;
@@ -638,12 +704,26 @@ static int __init iTCO_wdt_init_module(void)
 	if (err)
 		return err;
 
+#ifdef CONFIG_ACPI
+	if (warn_irq) {
+		err = acpi_bus_register_driver(&iTCO_wdt_acpi_driver);
+		if (err) {
+			platform_driver_unregister(&iTCO_wdt_driver);
+			return err;
+		}
+	}
+#endif
+
 	return 0;
 }
 
 static void __exit iTCO_wdt_cleanup_module(void)  {
 	platform_driver_unregister(&iTCO_wdt_driver);
+#ifdef CONFIG_ACPI
+	if (warn_irq)
+		acpi_bus_unregister_driver(&iTCO_wdt_acpi_driver);
+#endif
 	pr_info("Watchdog Module Unloaded\n");  }
 
--
1.7.9.5




-----Original Message-----
From: Rafael J. Wysocki [mailto:rjw@xxxxxxxxxxxxx] 
Sent: Tuesday, January 20, 2015 4:00 PM
To: Muller, Francois-nicolas
Cc: Guenter Roeck; Darren Hart; 'platform-driver-x86@xxxxxxxxxxxxxxx'; Linux ACPI Mailing List; linux-watchdog@xxxxxxxxxxxxxxx; Wim Van Sebroeck
Subject: Re: [PATCH] TCO Watchdog warning interrupt driver creation

On Wednesday, January 14, 2015 04:38:49 PM Muller, Francois-nicolas wrote:
> From 54d2ff5e13c1b35d5019b82376dabb903ebe30d6 Mon Sep 17 00:00:00 2001
> From: Francois-Nicolas Muller <francois-nicolas.muller@xxxxxxxxx>
> Date: Wed, 14 Jan 2015 14:27:43 +0100
> Subject: [PATCH] Adding TCO watchdog warning interrupt handling.
> 
> This feature is useful to root cause watchdog expiration.
> It is activated by boot parameter 'warn_irq' (disabled by default).
> 
> Upon first expiration of the TCO watchdog, a warning interrupt is 
> fired then the interrupt handler dumps registers and call stack of all available cores.
> 
> Finally panic() is called and notifies the panic handlers if any. At 
> the same time, the TCO watchdog reloads with 2.4 seconds timeout value.
> 
> When warning interrupt is enabled, platform reboot depends on 
> CONFIG_PANIC_TIMEOUT value :
> 
> - If CONFIG_PANIC_TIMEOUT is zero or greater than 3 seconds, TCO 
> watchdog will reset the platform if second expiration happens before 
> TCO has been kicked again.
> 
> - If CONFIG_PANIC_TIMEOUT is < 0, platform will reboot immediately 
> (emergency restart procedure).
> 
> - If CONFIG_PANIC_TIMEOUT is 1 or 2 seconds, platform will reboot 
> after 1 or 2 seconds delay (emergency restart procedure).
> 
> Change-Id: I009c41f2f3dc3bd091b4d2a45b4ea0be85c8ce27
> Signed-off-by: Francois-Nicolas Muller 
> <francois-nicolas.muller@xxxxxxxxx>
> ---
>  drivers/watchdog/iTCO_wdt.c | 64 
> +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 64 insertions(+)
> 
> diff --git a/drivers/watchdog/iTCO_wdt.c b/drivers/watchdog/iTCO_wdt.c 
> index e802a54..a8c16d2 100644
> --- a/drivers/watchdog/iTCO_wdt.c
> +++ b/drivers/watchdog/iTCO_wdt.c
> @@ -49,6 +49,8 @@
>  /* Module and version information */
>  #define DRV_NAME	"iTCO_wdt"
>  #define DRV_VERSION	"1.11"
> +#define DRV_NAME_ACPI	"iTCO_wdt_wirq"
> +#define TCO_CLASS	DRV_NAME
>  
>  /* Includes */
>  #include <linux/module.h>		/* For module specific items */
> @@ -68,6 +70,9 @@
>  #include <linux/pm.h>			/* For suspend/resume */
>  #include <linux/mfd/core.h>
>  #include <linux/mfd/lpc_ich.h>
> +#include <linux/nmi.h>
> +#include <linux/acpi.h>
> +#include <acpi/actypes.h>
>  
>  #include "iTCO_vendor.h"
>  
> @@ -107,6 +112,12 @@ static struct {		/* this is private data for the iTCO_wdt device */
>  	bool started;
>  } iTCO_wdt_private;
>  
> +static const struct acpi_device_id iTCO_wdt_ids[] = {
> +	{"8086229C", 0},

This is not a proper ACPI or PNP device name as far as I can say.

What is it?

> +	{"", 0},
> +};
> +MODULE_DEVICE_TABLE(acpi, iTCO_wdt_ids);
> +
>  /* module parameters */
>  #define WATCHDOG_TIMEOUT 30	/* 30 sec default heartbeat */
>  static int heartbeat = WATCHDOG_TIMEOUT;  /* in seconds */ @@ -126,6 
> +137,11 @@ module_param(turn_SMI_watchdog_clear_off, int, 0);  
> MODULE_PARM_DESC(turn_SMI_watchdog_clear_off,
>  	"Turn off SMI clearing watchdog (depends on 
> TCO-version)(default=1)");
>  
> +static bool warn_irq;
> +module_param(warn_irq, bool, 0);
> +MODULE_PARM_DESC(warn_irq,
> +	"Watchdog trigs a panic at first expiration (default=0)");
> +
>  /*
>   * Some TCO specific functions
>   */
> @@ -200,6 +216,35 @@ static int iTCO_wdt_unset_NO_REBOOT_bit(void)
>  	return ret; /* returns: 0 = OK, -EIO = Error */  }
>  
> +static u32 iTCO_wdt_wirq(acpi_handle gpe_device, u32 gpe, void 
> +*context) {
> +	trigger_all_cpu_backtrace();
> +	panic("Kernel Watchdog");
> +
> +	/* This code should not be reached */
> +	return IRQ_HANDLED;
> +}
> +
> +static int iTCO_wdt_acpi_add(struct acpi_device *device) {
> +	unsigned long long gpe;
> +	acpi_status status;
> +

The code below seems to mean: "If the device _HID returns '8086229C', there should be a _GPE object under it which then returns the number of the GPE to bind to within the FADT GPE blocks."

Where is this documented?

> +	status = acpi_evaluate_integer(device->handle, "_GPE", NULL, &gpe);
> +	if (ACPI_FAILURE(status))
> +		return -EINVAL;
> +
> +	status = acpi_install_gpe_handler(NULL, gpe, ACPI_GPE_EDGE_TRIGGERED,
> +					  iTCO_wdt_wirq, NULL);
> +	if (ACPI_FAILURE(status))
> +		return -ENODEV;
> +
> +	acpi_enable_gpe(NULL, gpe);
> +
> +	pr_debug("interrupt=SCI GPE=0x%02llx", gpe);
> +	return 0;
> +}
> +
>  static int iTCO_wdt_start(struct watchdog_device *wd_dev)  {
>  	unsigned int val;
> @@ -628,6 +673,15 @@ static struct platform_driver iTCO_wdt_driver = {
>  	},
>  };
>  
> +static struct acpi_driver iTCO_wdt_acpi_driver = {
> +	.name = DRV_NAME_ACPI,
> +	.class = TCO_CLASS,
> +	.ids = iTCO_wdt_ids,
> +	.ops = {
> +		.add = iTCO_wdt_acpi_add,
> +	},
> +};
> +
>  static int __init iTCO_wdt_init_module(void)  {
>  	int err;
> @@ -638,12 +692,22 @@ static int __init iTCO_wdt_init_module(void)
>  	if (err)
>  		return err;
>  
> +	if (warn_irq) {
> +		err = acpi_bus_register_driver(&iTCO_wdt_acpi_driver);
> +		if (err) {
> +			platform_driver_unregister(&iTCO_wdt_driver);
> +			return err;
> +		}
> +	}
> +
>  	return 0;
>  }
>  
>  static void __exit iTCO_wdt_cleanup_module(void)  {
>  	platform_driver_unregister(&iTCO_wdt_driver);
> +	if (warn_irq)
> +		acpi_bus_unregister_driver(&iTCO_wdt_acpi_driver);
>  	pr_info("Watchdog Module Unloaded\n");  }
>  
> --
> 1.9.1

And does it build for CONFIG_ACPI unset?

��.n��������+%������w��{.n�����{�����ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f





[Index of Archives]     [Linux IBM ACPI]     [Linux Power Management]     [Linux Kernel]     [Linux Laptop]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux