Using the Xeon iTCO_wdt for debugging software lockups

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all.

I’m one of the LEDE developers, and on a particular platform (not reproducible on others) we see a particular box (a Lanner FW-8771, with a E3-1225 v3 processor) hang... can’t tell if it’s a hardware or software issue but the fact that it doesn’t happen on any other platforms (tried on Xeon D-1518 based Supermicro 5018D-FN8T and a Lanner FW-7568 with Atom D-525) makes me suspect it’s hardware... but I need to be sure.

I’ve been using iTCO_wdt watchdog to generate resets when the processor stops tickling the watchdog from user-space, and that reboots it within 60 seconds of it becoming non-responsive.

But what I can’t figure out how to do is to generate an NMI so that I can force a panic and see why all the processors seem to be looping or deadlocked.

I’m using 4.9.49 and therefore the 1.11 version of the driver.

Looking at it, iTCO_wdt_start() seems to call iTCO_wdt_unset_NO_REBOOT_bit() unconditionally, so you can’t choose between an SMI reset (via RSMRST# if I’ve understood the C226/PCH databook) and an NMI.

Is this intentional?

What would a patch look like to instead allow NMI’s when the watchdog expires?  Would I need to set NMI_EN=1 and GLB_SMI_EN=1 also or is this already set elsewhere?

And I would have thought that iTCO_wdt_unset/set_NO_REBOOT_bit() would diddle bit 9 (NMI2SMI_EN) of TCO1_CNT but it seems to be doing something else.

What’s a quick hack to get NMI’s enabled?

I thought maybe the following would do it but it’s lacking manipulating NMI_EN, GLB_SMI_EN, and NMI2SMI_EN:

--- ./drivers/watchdog/iTCO_wdt.c.orig	2017-09-13 15:13:54.000000000 -0600
+++ ./drivers/watchdog/iTCO_wdt.c	2017-09-21 11:45:28.320904534 -0600
@@ -126,6 +126,12 @@ module_param(turn_SMI_watchdog_clear_off
 MODULE_PARM_DESC(turn_SMI_watchdog_clear_off,
 	"Turn off SMI clearing watchdog (depends on TCO-version)(default=1)");
 
+static bool use_nmi = 0;
+module_param(use_nmi, bool, 0);
+MODULE_PARM_DESC(use_nmi,
+	"Use NMI when watchdog expires (default="
+				__MODULE_STRING(0) ")");
+
 /*
  * Some TCO specific functions
  */
@@ -218,7 +224,7 @@ static int iTCO_wdt_start(struct watchdo
 	iTCO_vendor_pre_start(iTCO_wdt_private.smi_res, wd_dev->timeout);
 
 	/* disable chipset's NO_REBOOT bit */
-	if (iTCO_wdt_unset_NO_REBOOT_bit()) {
+	if (!use_nmi && iTCO_wdt_unset_NO_REBOOT_bit()) {
 		spin_unlock(&iTCO_wdt_private.io_lock);
 		pr_err("failed to reset NO_REBOOT flag, reboot disabled by hardware/BIOS\n");
 		return -EIO;



--
To unsubscribe from this list: send the line "unsubscribe linux-watchdog" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux