Re: "default" watchdog device - ?

Nir Soffer <nsoffer@xxxxxxxxxx> · Fri, 8 Apr 2022 23:35:51 +0300

On Tue, Apr 5, 2022 at 7:27 PM lejeczek <peljasz@xxxxxxxxxxx> wrote:
>
>
>
> On 29/03/2022 20:25, Nir Soffer wrote:
> > On Wed, Mar 16, 2022 at 1:55 PM lejeczek <peljasz@xxxxxxxxxxx> wrote:
> >>
> >>
> >> On 15/03/2022 11:21, Daniel P. Berrangé wrote:
> >>> On Tue, Mar 15, 2022 at 10:39:50AM +0000, lejeczek wrote:
> >>>> Hi guys.
> >>>>
> >>>> Without explicitly, manually using watchdog device for a VM, the VM (centOS
> >>>> 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists.
> >>>> To double check - 'dumpxml' does not show any such device - what kind of a
> >>>> 'watchdog' that is?
> >>> The kernel can always provide a pure software watchdog IIRC. It can be
> >>> useful if a userspace app wants a watchdog. The limitation is that it
> >>> relies on the kernel remaining functional, as there's no hardware
> >>> backing it up.
> >>>
> >>> Regards,
> >>> Daniel
> >> On a related note - with 'i6300esb' watchdog which I tested
> >> and I believe is working.
> >> I get often in my VMs from 'dmesg':
> >> ...
> >> watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0]
> >> rcu: INFO: rcu_sched self-detected stall on CPU
> >> ...
> >> This above is from Ubuntu and CentOS alike and when this
> >> happens, console via VNC responds to until first 'enter'
> >> then is non-resposive.
> >> This happens after VM(s) was migrated between hosts, but
> >> anyway..
> >> I do not see what I expected from 'watchdog' - there is no
> >> action whatsoever, which should be 'reset'. VM remains in
> >> such 'frozen' state forever.
> >>
> >> any & all shared thoughts much appreciated.
> >> L.
> > You need to run some userspace tool that will open the watchdog
> > device, and pet it periodically, telling the kernel that userspace is alive.
> >
> > If this tool will stop petting the watchdog, maybe because of a soft lockup
> > or other trouble, the watchdog device will reset the VM.
> >
> > watchdog(8) may be the tool you need.
> >
> > See also
> > https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.rst
> >
> > Nir
> >
> I do not think that 'i6300esb' watchog works under those
> soft-lockups, whether it's qemu or OS end I cannot say.
> With:
>      <watchdog model='i6300esb' action='reset'/>
> in dom xml OS sees:
> -> $ llr /dev/watchdog*
> crw-------. 1 root root  10, 130 Apr  5 16:59 /dev/watchdog
> crw-------. 1 root root 248,   0 Apr  5 16:59 /dev/watchdog0
> crw-------. 1 root root 248,   1 Apr  5 16:59 /dev/watchdog1
> and
> -> $ wdctl
> Device:        /dev/watchdog
> Identity:      i6300ESB timer [version 0]
> Timeout:       30 seconds
> Pre-timeout:    0 seconds
> FLAG           DESCRIPTION               STATUS BOOT-STATUS
> KEEPALIVEPING  Keep alive ping reply          1           0
> MAGICCLOSE     Supports magic close char      0           0
> SETTIMEOUT     Set timeout (in seconds)       0           0
>
> If it worked, the HW watchdog, then 'i6300esb' should reset
> the VM if nothing is pinging the watchdog - I read that it's
> possible to exit 'software' watchdog and not to cause HW
> watchdog take action. I do not know it that's happening here
> when I just 'systemclt stop watchdog'
> In '/etc/watchdog.conf' I do not point to any specific
> device, which I believe makes watchdogd do its things.
> Simple test:
> -> $ cat >> /dev/watchdog
> & 'Enter' press twice
> does invoke 'reset' action and I was to believe 'wdctl' that
> is HW watchdog working. But!...
> The main issue I have are those "soft lockups" where VM's OS
> becomes frozen, but nothing from the watchdog, no action -
> though, as VM is in such frozen state host shows high CPU
> for the VM.
>
> I do not anything fancy so I really wonder if what I see is
> that rare.
> Soft-lockup occur I think usually, cannot say that uniquely
> though, during or after VM live-migration.
>
> thanks, L.

On my fedora 35 vm, I see that /dev/watchdog0 is the right device:

# wdctl
Device:        /dev/watchdog0
Identity:      i6300ESB timer [version 0]
Timeout:       30 seconds
Pre-timeout:    0 seconds
FLAG           DESCRIPTION               STATUS BOOT-STATUS
KEEPALIVEPING  Keep alive ping reply          1           0
MAGICCLOSE     Supports magic close char      0           0
SETTIMEOUT     Set timeout (in seconds)       0           0

I tested this script:

# cat watchdog-test.py
import os
import time

fd = os.open("/dev/watchdog0", os.O_WRONLY)

print("Opened /dev/watchdog0") cat /etc/watchdog.conf | grep watchdog-device
watchdog-device = /dev/watchdog0

for i in range(1, 120):
    time.sleep(1)
    print(i)

# python3 watchdog-test.py
Opened /dev/watchdog0
1
2
3
...
30

The VM was reset after 30 seconds, showing that the hardware watchdog works.

I also tested the watchdog package, with this configuration:

# cat /etc/watchdog.conf
...
watchdog-device = /dev/watchdog0

Then starting the service:

# systemctl status watchdog
● watchdog.service - watchdog daemon
     Loaded: loaded (/usr/lib/systemd/system/watchdog.service;
enabled; vendor preset: disabled)
     Active: active (running) since Fri 2022-04-08 23:23:54 IDT; 7min ago
    Process: 757 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS)
   Main PID: 759 (watchdog)
      Tasks: 1 (limit: 2310)
     Memory: 616.0K
        CPU: 101ms
     CGroup: /system.slice/watchdog.service
             └─759 /usr/sbin/watchdog

Apr 08 23:23:54 fedora35 watchdog[759]:  interface: no interface to check
Apr 08 23:23:54 fedora35 watchdog[759]:  temperature: no sensors to check
Apr 08 23:23:54 fedora35 watchdog[759]:  no test binary files
Apr 08 23:23:54 fedora35 watchdog[759]:  no repair binary files
Apr 08 23:23:54 fedora35 watchdog[759]:  error retry time-out = 60 seconds
Apr 08 23:23:54 fedora35 watchdog[759]:  repair attempts = 1
Apr 08 23:23:54 fedora35 watchdog[759]:  alive=/dev/watchdog0
heartbeat=[none] to=root no_act=no force=no
Apr 08 23:23:54 fedora35 watchdog[759]: watchdog now set to 60 seconds
Apr 08 23:23:54 fedora35 watchdog[759]: hardware watchdog identity:
i6300ESB timer
Apr 08 23:23:54 fedora35 systemd[1]: Started watchdog daemon.

Finally, stopping the watchdog daemon:

# kill -STOP 759

And the VM was reset in about 60 seconds.

So I think it can work for your use case.

You can try  to find a way to trigger a soft lockup, or maybe crash the kernel
to test this.

Nir