Please find hereafter a new version of the patch with a documentation file and that builds with CONFIG_ACPI unset. >From a7135e6b4bc7c91d6ac72a4f38157f7f2308615b Mon Sep 17 00:00:00 2001 From: Francois-Nicolas Muller <francois-nicolas.muller@xxxxxxxxx> Date: Tue, 20 Jan 2015 14:55:42 +0100 Subject: [PATCH] Adding TCO watchdog warning interrupt handling. This feature is useful to root cause watchdog expiration. It is activated by boot parameter 'warn_irq' (disabled by default). Upon first expiration of the TCO watchdog, a warning interrupt is fired, then the interrupt handler dumps registers and call stack of all available cpus. TCO watchdog reloads with 2.4 seconds timeout for second expiration. If CONFIG_ITCO_WARN_PANIC is set, the warning interrupt also calls panic() which notifies the panic handlers then reboots the platform, depending on CONFIG_PANIC_TIMEOUT value : - If CONFIG_PANIC_TIMEOUT is zero or greater than 3 seconds, TCO watchdog will reset the platform if second expiration happens before TCO has been kicked again. - If CONFIG_PANIC_TIMEOUT is < 0, platform will reboot immediately (emergency restart procedure). - If CONFIG_PANIC_TIMEOUT is 1 or 2 seconds, platform will reboot after 1 or 2 seconds delay (emergency restart procedure). See Documentation/watchdog/tco-wdt-warning-interrupt.txt for more details. Change-Id: I7314a50812529423b117cf28e4a195a356da2f57 Signed-off-by: Francois-Nicolas Muller <francois-nicolas.muller@xxxxxxxxx> --- .../watchdog/tco-wdt-warning-interrupt.txt | 85 ++++++++++++++++++++ drivers/watchdog/Kconfig | 13 +++ drivers/watchdog/iTCO_wdt.c | 80 ++++++++++++++++++ 3 files changed, 178 insertions(+) create mode 100644 Documentation/watchdog/tco-wdt-warning-interrupt.txt diff --git a/Documentation/watchdog/tco-wdt-warning-interrupt.txt b/Documentation/watchdog/tco-wdt-warning-interrupt.txt new file mode 100644 index 0000000..2e4eebf --- /dev/null +++ b/Documentation/watchdog/tco-wdt-warning-interrupt.txt @@ -0,0 +1,85 @@ +Last reviewed: 02/12/2015 + + TCO watchdog warning interrupt + handled by drivers/watchdog/iTCO_wdt.c + Documentation and code by + Francois-Nicolas Muller <francois-nicolas.muller@xxxxxxxxx> + + +Introduction +------------ +Intel TCO watchdog is intended to detect and recover from locks up of the +platform. It contains a countdown timer, that should be reloaded on-time by +software before reaching zero. + +If the platform locks up and is not able to reload the timer, then when it +reaches zero: +- the timer is automatically reloaded with 04h and starts decrementing again, +- timeout bit is set in TCO1_STS register, +- SMI or SCI interrupt is generated (optional). + +If it reaches zero a second time while timeout bit is set, +- second_to_sts bit is set, +- reset of the platform is initiated. + +At first timeout, the SMI (or SCI) can be used to provide debug information +about the system state and help on fixing the cause of the hang. This is the +"warning interrupt". + +Warning interrupt +----------------- +Warning interrupt handler is called when system is hung, so it is useful to +gather maximum information about system state at this point for root-causing the +issue. + +When the interrupt occurs, +- call stacks of all running cpus are dumped, +- panic() is called (optional) + +Enabling the warning interrupt +------------------------------ +Boot parameter "warn_irq" (boolean) enables warning interrupt generation at +first timer expiration (disabled by default). + +As this is a command line option, configuration can be changed easily without +building again the code. + +Enabling panic upon warning interrupt +------------------------------------- +Warning interrupt handler can call panic() when Kconfig option +CONFIG_ITCO_WARNING_PANIC is set. + +panic() call is useful in case of some panic handlers have been registered and +need to be run at this time. + +When CONFIG_ITCO_WARN_PANIC is set, +- If CONFIG_PANIC_TIMEOUT is zero or greater than 3 seconds, TCO watchdog will + reset the platform if second expiration happens before TCO has been kicked + again. +- If CONFIG_PANIC_TIMEOUT is < 0, platform will reboot immediately (emergency + restart procedure). +- If CONFIG_PANIC_TIMEOUT is 1 or 2 seconds, platform will reboot after 1 or 2 + seconds delay (emergency restart procedure). + +SCI vs SMI +---------- +For the moment, the TCO watchdog warning interrupt feature is only available for +platforms that are able to trigger a SCI upon first expiration of TCO watchdog. + +There is no support of the SMI option yet. + +ACPI configuration +------------------ +Bios is configuring the GPE associated to the warning interrupt. The driver uses +acpi tables to get the GPE number. + +This change is intended for Intel Cherrytrail platform. As TCO watchdog is part +of lpc_ich module, its _HID is used in the driver to retrieve GPE configuration +from Bios. + +If no GPE information is provided by the Bios, the interrupt is not handled and +appears in the dmesg log as a warning. Second timeout is still able to trigger a +reset. + +-- Francois-Nicolas Muller + (francois-nicolas.muller@xxxxxxxxx) diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index 79d2589..41f3647 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -674,6 +674,19 @@ config ITCO_VENDOR_SUPPORT devices. At this moment we only have additional support for some SuperMicro Inc. motherboards. +config ITCO_WARNING_PANIC + bool "Intel TCO Timer/Watchdog panic on warning interrupt" + depends on ITCO_WDT && ACPI + default y + ---help--- + Force a call to panic() when TCO warning interrupt occurs. + + Warning interrupt happens if warn_irq module parameter is set and + TCO timer first expires. + + If not set, only cpu backtraces are dumped, no call to panic() and + no notification of panic are done. + config IT8712F_WDT tristate "IT8712F (Smart Guardian) Watchdog Timer" depends on X86 diff --git a/drivers/watchdog/iTCO_wdt.c b/drivers/watchdog/iTCO_wdt.c index e802a54..a25794c 100644 --- a/drivers/watchdog/iTCO_wdt.c +++ b/drivers/watchdog/iTCO_wdt.c @@ -49,6 +49,8 @@ /* Module and version information */ #define DRV_NAME "iTCO_wdt" #define DRV_VERSION "1.11" +#define DRV_NAME_ACPI "iTCO_wdt_wirq" +#define TCO_CLASS DRV_NAME /* Includes */ #include <linux/module.h> /* For module specific items */ @@ -68,6 +70,11 @@ #include <linux/pm.h> /* For suspend/resume */ #include <linux/mfd/core.h> #include <linux/mfd/lpc_ich.h> +#include <linux/nmi.h> +#ifdef CONFIG_ACPI +#include <linux/acpi.h> +#include <acpi/actypes.h> +#endif #include "iTCO_vendor.h" @@ -107,6 +114,14 @@ static struct { /* this is private data for the iTCO_wdt device */ bool started; } iTCO_wdt_private; +#ifdef CONFIG_ACPI +static const struct acpi_device_id iTCO_wdt_ids[] = { + {"8086229C", 0}, + {"", 0}, +}; +MODULE_DEVICE_TABLE(acpi, iTCO_wdt_ids); +#endif + /* module parameters */ #define WATCHDOG_TIMEOUT 30 /* 30 sec default heartbeat */ static int heartbeat = WATCHDOG_TIMEOUT; /* in seconds */ @@ -126,6 +141,15 @@ module_param(turn_SMI_watchdog_clear_off, int, 0); MODULE_PARM_DESC(turn_SMI_watchdog_clear_off, "Turn off SMI clearing watchdog (depends on TCO-version)(default=1)"); +static bool warn_irq; +module_param(warn_irq, bool, 0); +MODULE_PARM_DESC(warn_irq, + "Dump all cpus backtraces at first watchdog timer expiration (default=0)"); + +#ifdef CONFIG_ACPI +static bool warn_irq_panic = CONFIG_ITCO_WARNING_PANIC; +#endif + /* * Some TCO specific functions */ @@ -200,6 +224,37 @@ static int iTCO_wdt_unset_NO_REBOOT_bit(void) return ret; /* returns: 0 = OK, -EIO = Error */ } +#ifdef CONFIG_ACPI +static u32 iTCO_wdt_wirq(acpi_handle gpe_device, u32 gpe, void *context) +{ + trigger_all_cpu_backtrace(); + if (warn_irq_panic) + panic("Kernel Watchdog"); + + return IRQ_HANDLED; +} + +static int iTCO_wdt_acpi_add(struct acpi_device *device) +{ + unsigned long long gpe; + acpi_status status; + + status = acpi_evaluate_integer(device->handle, "_GPE", NULL, &gpe); + if (ACPI_FAILURE(status)) + return -EINVAL; + + status = acpi_install_gpe_handler(NULL, gpe, ACPI_GPE_EDGE_TRIGGERED, + iTCO_wdt_wirq, NULL); + if (ACPI_FAILURE(status)) + return -ENODEV; + + acpi_enable_gpe(NULL, gpe); + + pr_debug("interrupt=SCI GPE=0x%02llx", gpe); + return 0; +} +#endif + static int iTCO_wdt_start(struct watchdog_device *wd_dev) { unsigned int val; @@ -628,6 +683,17 @@ static struct platform_driver iTCO_wdt_driver = { }, }; +#ifdef CONFIG_ACPI +static struct acpi_driver iTCO_wdt_acpi_driver = { + .name = DRV_NAME_ACPI, + .class = TCO_CLASS, + .ids = iTCO_wdt_ids, + .ops = { + .add = iTCO_wdt_acpi_add, + }, +}; +#endif + static int __init iTCO_wdt_init_module(void) { int err; @@ -638,12 +704,26 @@ static int __init iTCO_wdt_init_module(void) if (err) return err; +#ifdef CONFIG_ACPI + if (warn_irq) { + err = acpi_bus_register_driver(&iTCO_wdt_acpi_driver); + if (err) { + platform_driver_unregister(&iTCO_wdt_driver); + return err; + } + } +#endif + return 0; } static void __exit iTCO_wdt_cleanup_module(void) { platform_driver_unregister(&iTCO_wdt_driver); +#ifdef CONFIG_ACPI + if (warn_irq) + acpi_bus_unregister_driver(&iTCO_wdt_acpi_driver); +#endif pr_info("Watchdog Module Unloaded\n"); } -- 1.7.9.5 -----Original Message----- From: Rafael J. Wysocki [mailto:rjw@xxxxxxxxxxxxx] Sent: Tuesday, January 20, 2015 4:00 PM To: Muller, Francois-nicolas Cc: Guenter Roeck; Darren Hart; 'platform-driver-x86@xxxxxxxxxxxxxxx'; Linux ACPI Mailing List; linux-watchdog@xxxxxxxxxxxxxxx; Wim Van Sebroeck Subject: Re: [PATCH] TCO Watchdog warning interrupt driver creation On Wednesday, January 14, 2015 04:38:49 PM Muller, Francois-nicolas wrote: > From 54d2ff5e13c1b35d5019b82376dabb903ebe30d6 Mon Sep 17 00:00:00 2001 > From: Francois-Nicolas Muller <francois-nicolas.muller@xxxxxxxxx> > Date: Wed, 14 Jan 2015 14:27:43 +0100 > Subject: [PATCH] Adding TCO watchdog warning interrupt handling. > > This feature is useful to root cause watchdog expiration. > It is activated by boot parameter 'warn_irq' (disabled by default). > > Upon first expiration of the TCO watchdog, a warning interrupt is > fired then the interrupt handler dumps registers and call stack of all available cores. > > Finally panic() is called and notifies the panic handlers if any. At > the same time, the TCO watchdog reloads with 2.4 seconds timeout value. > > When warning interrupt is enabled, platform reboot depends on > CONFIG_PANIC_TIMEOUT value : > > - If CONFIG_PANIC_TIMEOUT is zero or greater than 3 seconds, TCO > watchdog will reset the platform if second expiration happens before > TCO has been kicked again. > > - If CONFIG_PANIC_TIMEOUT is < 0, platform will reboot immediately > (emergency restart procedure). > > - If CONFIG_PANIC_TIMEOUT is 1 or 2 seconds, platform will reboot > after 1 or 2 seconds delay (emergency restart procedure). > > Change-Id: I009c41f2f3dc3bd091b4d2a45b4ea0be85c8ce27 > Signed-off-by: Francois-Nicolas Muller > <francois-nicolas.muller@xxxxxxxxx> > --- > drivers/watchdog/iTCO_wdt.c | 64 > +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 64 insertions(+) > > diff --git a/drivers/watchdog/iTCO_wdt.c b/drivers/watchdog/iTCO_wdt.c > index e802a54..a8c16d2 100644 > --- a/drivers/watchdog/iTCO_wdt.c > +++ b/drivers/watchdog/iTCO_wdt.c > @@ -49,6 +49,8 @@ > /* Module and version information */ > #define DRV_NAME "iTCO_wdt" > #define DRV_VERSION "1.11" > +#define DRV_NAME_ACPI "iTCO_wdt_wirq" > +#define TCO_CLASS DRV_NAME > > /* Includes */ > #include <linux/module.h> /* For module specific items */ > @@ -68,6 +70,9 @@ > #include <linux/pm.h> /* For suspend/resume */ > #include <linux/mfd/core.h> > #include <linux/mfd/lpc_ich.h> > +#include <linux/nmi.h> > +#include <linux/acpi.h> > +#include <acpi/actypes.h> > > #include "iTCO_vendor.h" > > @@ -107,6 +112,12 @@ static struct { /* this is private data for the iTCO_wdt device */ > bool started; > } iTCO_wdt_private; > > +static const struct acpi_device_id iTCO_wdt_ids[] = { > + {"8086229C", 0}, This is not a proper ACPI or PNP device name as far as I can say. What is it? > + {"", 0}, > +}; > +MODULE_DEVICE_TABLE(acpi, iTCO_wdt_ids); > + > /* module parameters */ > #define WATCHDOG_TIMEOUT 30 /* 30 sec default heartbeat */ > static int heartbeat = WATCHDOG_TIMEOUT; /* in seconds */ @@ -126,6 > +137,11 @@ module_param(turn_SMI_watchdog_clear_off, int, 0); > MODULE_PARM_DESC(turn_SMI_watchdog_clear_off, > "Turn off SMI clearing watchdog (depends on > TCO-version)(default=1)"); > > +static bool warn_irq; > +module_param(warn_irq, bool, 0); > +MODULE_PARM_DESC(warn_irq, > + "Watchdog trigs a panic at first expiration (default=0)"); > + > /* > * Some TCO specific functions > */ > @@ -200,6 +216,35 @@ static int iTCO_wdt_unset_NO_REBOOT_bit(void) > return ret; /* returns: 0 = OK, -EIO = Error */ } > > +static u32 iTCO_wdt_wirq(acpi_handle gpe_device, u32 gpe, void > +*context) { > + trigger_all_cpu_backtrace(); > + panic("Kernel Watchdog"); > + > + /* This code should not be reached */ > + return IRQ_HANDLED; > +} > + > +static int iTCO_wdt_acpi_add(struct acpi_device *device) { > + unsigned long long gpe; > + acpi_status status; > + The code below seems to mean: "If the device _HID returns '8086229C', there should be a _GPE object under it which then returns the number of the GPE to bind to within the FADT GPE blocks." Where is this documented? > + status = acpi_evaluate_integer(device->handle, "_GPE", NULL, &gpe); > + if (ACPI_FAILURE(status)) > + return -EINVAL; > + > + status = acpi_install_gpe_handler(NULL, gpe, ACPI_GPE_EDGE_TRIGGERED, > + iTCO_wdt_wirq, NULL); > + if (ACPI_FAILURE(status)) > + return -ENODEV; > + > + acpi_enable_gpe(NULL, gpe); > + > + pr_debug("interrupt=SCI GPE=0x%02llx", gpe); > + return 0; > +} > + > static int iTCO_wdt_start(struct watchdog_device *wd_dev) { > unsigned int val; > @@ -628,6 +673,15 @@ static struct platform_driver iTCO_wdt_driver = { > }, > }; > > +static struct acpi_driver iTCO_wdt_acpi_driver = { > + .name = DRV_NAME_ACPI, > + .class = TCO_CLASS, > + .ids = iTCO_wdt_ids, > + .ops = { > + .add = iTCO_wdt_acpi_add, > + }, > +}; > + > static int __init iTCO_wdt_init_module(void) { > int err; > @@ -638,12 +692,22 @@ static int __init iTCO_wdt_init_module(void) > if (err) > return err; > > + if (warn_irq) { > + err = acpi_bus_register_driver(&iTCO_wdt_acpi_driver); > + if (err) { > + platform_driver_unregister(&iTCO_wdt_driver); > + return err; > + } > + } > + > return 0; > } > > static void __exit iTCO_wdt_cleanup_module(void) { > platform_driver_unregister(&iTCO_wdt_driver); > + if (warn_irq) > + acpi_bus_unregister_driver(&iTCO_wdt_acpi_driver); > pr_info("Watchdog Module Unloaded\n"); } > > -- > 1.9.1 And does it build for CONFIG_ACPI unset? ��.n��������+%������w��{.n�����{�����ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f