Re: watchdog: how to enable?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Nov 18, 2019 at 7:40 PM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
>
> On 11/18/19 1:52 AM, Muni Sekhar wrote:
> > On Sun, Nov 17, 2019 at 3:12 AM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
> >>
> >> On 11/16/19 10:34 AM, Muni Sekhar wrote:
> >>> On Sat, Nov 16, 2019 at 9:31 PM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:
> >>>>
> >>>> On 11/15/19 7:03 PM, Muni Sekhar wrote:
> >>>> [ ... ]
> >>>>>>
> >>>>>> Another possibility, of course, might be to enable a hardware watchdog
> >>>>>> in your system (assuming it supports one). I personally would not trust
> >>>>>> the NMI watchdog because to detect a system hang, after all, there are
> >>>>>> situations where even NMIs no longer work.
> >>>>>
> >>>>> >From dmesg , Is it possible to know whether my system supports
> >>>>> hardware watchdog or not?
> >>>>> I assume that my system supports the hardware watchdog , then how to
> >>>>> enable the hardware watchdog to debug the system freeze issues?
> >>>>>
> >>>>
> >>>> Hardware watchdog support really depends on the board type. Most PC
> >>>> mainboards support a watchdog in the Super-IO chip, but on some it is
> >>>> not wired correctly. On embedded boards it is often built into the SoC.
> >>>> The easiest way to see if you have a watchdog would be to check for the
> >>>> existence of /dev/watchdog. However, on a PC that would most likely
> >>>> not be there because the necessary module is not auto-loaded.
> >>>> If you tell us your board type, or better the Super-IO chip on the board,
> >>>> we might be able to help.
> >>>
> >>> I’m having two same configuration systems, in one system I installed
> >>> the Vanilla kernel and I see the /dev/watchdog and /dev/watchdog0
> >>> nodes. In other system I’m running with ubuntu distribution kernel,
> >>> but I don’t see any watchdog device node. So it looks like I need to
> >>> manually load the kernel module in distro kernel. Is there a way to
> >>> know what is the corresponding kernel module for  /dev/watchdog node?
> >>>
> >>> # ls -l /dev/watchdog*
> >>> crw------- 1 root root  10, 130 Nov 15 17:15 /dev/watchdog
> >>> crw------- 1 root root 248,   0 Nov 15 17:15 /dev/watchdog0
> >>>
> >>> # ps -ax | grep watchdog
> >>>     678 ?        S      0:00 [watchdogd]
> >>>
> >>> Regarding Super-IO chip, how to find out the Super-IO chip model?
> >>>
> >> You could try to run sensors-detect (from the "sensors" package).
> >>
> >> If you can boot a system with /dev/watchdog0, you should see the type
> >> in /sys/class/watchdog/watchdog0/identity.
> > I could not find the /sys/class/watchdog/watchdog0/identity and
> > /sys/class/watchdog/watchdog0/timeout files.
> > $ ls -l /sys/class/watchdog/watchdog0/
> > total 0
> > -r--r--r-- 1 root root 4096 Nov 18 15:12 dev
> > lrwxrwxrwx 1 root root    0 Nov 18 15:12 device -> ../../../iTCO_wdt.0.auto
> > drwxr-xr-x 2 root root    0 Nov 18 15:12 power
> > lrwxrwxrwx 1 root root    0 Nov 18 14:53 subsystem ->
> > ../../../../../../class/watchdog
> > -rw-r--r-- 1 root root 4096 Nov 18 14:53 uevent
> >
>
> Presumably CONFIG_WATCHDOG_SYSFS is not enabled in your configuration.
>
> >>
> >> Also, you can test if the watchdog works with "sudo cat /dev/watchdog",
> >> assuming the watchdog daemon is not running. The watchdog works if the
> >> system reboots after the watchdog times out (/sys/class/watchdog/watchdog0/timeout
> >> is the timeout in seconds).
> > sudo cat /dev/watchdog perfectly rebooted my system. I don't see
> > timeout node, how do I configure the timeout value?
>
> sudo apt-get install watchdog
> man watchdog
>
> should tell you. Alternatively, enable CONFIG_WATCHDOG_SYSFS.
>
> >>
> >>>>
> >>>> Note though that this won't help to debug the problem. A hardware
> >>>> watchdog resets the system. It helps to recover, but it is not intended
> >>>> to help with debugging.
> >>> How do I use the hardware watchdog to reset my system when system is
> >>> frozen? It helps me to collect the crashdump and finally helps me to
> >>> find the root cause for the system frozen issue.
> >>>
> >> There won't be a crashdump. It just hard-resets the system.
> > So is there any other solution to capture the crashdump or trigger
> > soft reboot once kernel is lockedup?
>
> Not that I know of. I suspect, though, that you either have a hard lockup
> where even NMI is non-operational, or NMI doesn't work in your system
> to start with.
>
> If you have nmi_watchdog=1 in your kernel command line, /proc/interrupts
> should show a non-zero number of NMI interrupts. Do you see that in your system ?

Yes, I see non-zero number. When it(NMI interrupt count) supposed to change?

$ cat /proc/interrupts | grep NMI
 NMI:       4129       4153       4192        183   Non-maskable interrupts

$ dmesg | grep NMI
[    0.402175] ACPI: LAPIC_NMI (acpi_id[0x01] high edge lint[0x1])
[    0.402199] ACPI: LAPIC_NMI (acpi_id[0x02] high edge lint[0x1])
[    0.402220] ACPI: LAPIC_NMI (acpi_id[0x03] high edge lint[0x1])
[    0.402242] ACPI: LAPIC_NMI (acpi_id[0x04] high edge lint[0x1])
[    4.636467] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[    4.658289] | NMI testsuite:
[   13.863284] INFO: NMI handler (kgdb_nmi_handler) took too long to
run: 9.744 msecs

Also I enabled pstore\ramoops. While testing the hardware watchdog by
running 'sudo cat /dev/watchdog', I see that console dump updates
between next boot. I see the same behavior consistently.

$ cat /sys/fs/pstore/console-ramoops-0
[  293.462623] printk: console [pstore-1] enabled
[  293.471026] pstore: Registered ramoops as persistent store backend
[  293.477800] ramoops: using 0x100000@0x3ff00000, ecc: 16
[  315.461263] systemd-journald[1665]: Sent WATCHDOG=1 notification.
[  317.447791] watchdog: watchdog0: nowayout prevents watchdog being stopped!
[  317.456616] watchdog: watchdog0: watchdog did not stop!
No errors detected

Now I installed the watchdog daemon and started that service before
the kernel locks up. On triggering few tests kernel locked up and
hardware watchdog triggered the reset, but in this case I don't see
console-ramoops-0 file. Only difference is , this time 'watchdog'
daemon triggered the hardware watchdog. Not sure why console dump not
updated in this scenario?


>
> Guenter



-- 
Thanks,
Sekhar




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux