On Mon, Nov 18, 2019 at 7:40 PM Guenter Roeck <linux@xxxxxxxxxxxx> wrote: > > On 11/18/19 1:52 AM, Muni Sekhar wrote: > > On Sun, Nov 17, 2019 at 3:12 AM Guenter Roeck <linux@xxxxxxxxxxxx> wrote: > >> > >> On 11/16/19 10:34 AM, Muni Sekhar wrote: > >>> On Sat, Nov 16, 2019 at 9:31 PM Guenter Roeck <linux@xxxxxxxxxxxx> wrote: > >>>> > >>>> On 11/15/19 7:03 PM, Muni Sekhar wrote: > >>>> [ ... ] > >>>>>> > >>>>>> Another possibility, of course, might be to enable a hardware watchdog > >>>>>> in your system (assuming it supports one). I personally would not trust > >>>>>> the NMI watchdog because to detect a system hang, after all, there are > >>>>>> situations where even NMIs no longer work. > >>>>> > >>>>> >From dmesg , Is it possible to know whether my system supports > >>>>> hardware watchdog or not? > >>>>> I assume that my system supports the hardware watchdog , then how to > >>>>> enable the hardware watchdog to debug the system freeze issues? > >>>>> > >>>> > >>>> Hardware watchdog support really depends on the board type. Most PC > >>>> mainboards support a watchdog in the Super-IO chip, but on some it is > >>>> not wired correctly. On embedded boards it is often built into the SoC. > >>>> The easiest way to see if you have a watchdog would be to check for the > >>>> existence of /dev/watchdog. However, on a PC that would most likely > >>>> not be there because the necessary module is not auto-loaded. > >>>> If you tell us your board type, or better the Super-IO chip on the board, > >>>> we might be able to help. > >>> > >>> I’m having two same configuration systems, in one system I installed > >>> the Vanilla kernel and I see the /dev/watchdog and /dev/watchdog0 > >>> nodes. In other system I’m running with ubuntu distribution kernel, > >>> but I don’t see any watchdog device node. So it looks like I need to > >>> manually load the kernel module in distro kernel. Is there a way to > >>> know what is the corresponding kernel module for /dev/watchdog node? > >>> > >>> # ls -l /dev/watchdog* > >>> crw------- 1 root root 10, 130 Nov 15 17:15 /dev/watchdog > >>> crw------- 1 root root 248, 0 Nov 15 17:15 /dev/watchdog0 > >>> > >>> # ps -ax | grep watchdog > >>> 678 ? S 0:00 [watchdogd] > >>> > >>> Regarding Super-IO chip, how to find out the Super-IO chip model? > >>> > >> You could try to run sensors-detect (from the "sensors" package). > >> > >> If you can boot a system with /dev/watchdog0, you should see the type > >> in /sys/class/watchdog/watchdog0/identity. > > I could not find the /sys/class/watchdog/watchdog0/identity and > > /sys/class/watchdog/watchdog0/timeout files. > > $ ls -l /sys/class/watchdog/watchdog0/ > > total 0 > > -r--r--r-- 1 root root 4096 Nov 18 15:12 dev > > lrwxrwxrwx 1 root root 0 Nov 18 15:12 device -> ../../../iTCO_wdt.0.auto > > drwxr-xr-x 2 root root 0 Nov 18 15:12 power > > lrwxrwxrwx 1 root root 0 Nov 18 14:53 subsystem -> > > ../../../../../../class/watchdog > > -rw-r--r-- 1 root root 4096 Nov 18 14:53 uevent > > > > Presumably CONFIG_WATCHDOG_SYSFS is not enabled in your configuration. > > >> > >> Also, you can test if the watchdog works with "sudo cat /dev/watchdog", > >> assuming the watchdog daemon is not running. The watchdog works if the > >> system reboots after the watchdog times out (/sys/class/watchdog/watchdog0/timeout > >> is the timeout in seconds). > > sudo cat /dev/watchdog perfectly rebooted my system. I don't see > > timeout node, how do I configure the timeout value? > > sudo apt-get install watchdog > man watchdog > > should tell you. Alternatively, enable CONFIG_WATCHDOG_SYSFS. > > >> > >>>> > >>>> Note though that this won't help to debug the problem. A hardware > >>>> watchdog resets the system. It helps to recover, but it is not intended > >>>> to help with debugging. > >>> How do I use the hardware watchdog to reset my system when system is > >>> frozen? It helps me to collect the crashdump and finally helps me to > >>> find the root cause for the system frozen issue. > >>> > >> There won't be a crashdump. It just hard-resets the system. > > So is there any other solution to capture the crashdump or trigger > > soft reboot once kernel is lockedup? > > Not that I know of. I suspect, though, that you either have a hard lockup > where even NMI is non-operational, or NMI doesn't work in your system > to start with. > > If you have nmi_watchdog=1 in your kernel command line, /proc/interrupts > should show a non-zero number of NMI interrupts. Do you see that in your system ? Yes, I see non-zero number. When it(NMI interrupt count) supposed to change? $ cat /proc/interrupts | grep NMI NMI: 4129 4153 4192 183 Non-maskable interrupts $ dmesg | grep NMI [ 0.402175] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1]) [ 0.402199] ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1]) [ 0.402220] ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1]) [ 0.402242] ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1]) [ 4.636467] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter. [ 4.658289] | NMI testsuite: [ 13.863284] INFO: NMI handler (kgdb_nmi_handler) took too long to run: 9.744 msecs Also I enabled pstore\ramoops. While testing the hardware watchdog by running 'sudo cat /dev/watchdog', I see that console dump updates between next boot. I see the same behavior consistently. $ cat /sys/fs/pstore/console-ramoops-0 [ 293.462623] printk: console [pstore-1] enabled [ 293.471026] pstore: Registered ramoops as persistent store backend [ 293.477800] ramoops: using 0x100000@0x3ff00000, ecc: 16 [ 315.461263] systemd-journald[1665]: Sent WATCHDOG=1 notification. [ 317.447791] watchdog: watchdog0: nowayout prevents watchdog being stopped! [ 317.456616] watchdog: watchdog0: watchdog did not stop! No errors detected Now I installed the watchdog daemon and started that service before the kernel locks up. On triggering few tests kernel locked up and hardware watchdog triggered the reset, but in this case I don't see console-ramoops-0 file. Only difference is , this time 'watchdog' daemon triggered the hardware watchdog. Not sure why console dump not updated in this scenario? > > Guenter -- Thanks, Sekhar