On Tue, Apr 5, 2022 at 7:27 PM lejeczek <peljasz@xxxxxxxxxxx> wrote: > > > > On 29/03/2022 20:25, Nir Soffer wrote: > > On Wed, Mar 16, 2022 at 1:55 PM lejeczek <peljasz@xxxxxxxxxxx> wrote: > >> > >> > >> On 15/03/2022 11:21, Daniel P. Berrangé wrote: > >>> On Tue, Mar 15, 2022 at 10:39:50AM +0000, lejeczek wrote: > >>>> Hi guys. > >>>> > >>>> Without explicitly, manually using watchdog device for a VM, the VM (centOS > >>>> 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists. > >>>> To double check - 'dumpxml' does not show any such device - what kind of a > >>>> 'watchdog' that is? > >>> The kernel can always provide a pure software watchdog IIRC. It can be > >>> useful if a userspace app wants a watchdog. The limitation is that it > >>> relies on the kernel remaining functional, as there's no hardware > >>> backing it up. > >>> > >>> Regards, > >>> Daniel > >> On a related note - with 'i6300esb' watchdog which I tested > >> and I believe is working. > >> I get often in my VMs from 'dmesg': > >> ... > >> watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0] > >> rcu: INFO: rcu_sched self-detected stall on CPU > >> ... > >> This above is from Ubuntu and CentOS alike and when this > >> happens, console via VNC responds to until first 'enter' > >> then is non-resposive. > >> This happens after VM(s) was migrated between hosts, but > >> anyway.. > >> I do not see what I expected from 'watchdog' - there is no > >> action whatsoever, which should be 'reset'. VM remains in > >> such 'frozen' state forever. > >> > >> any & all shared thoughts much appreciated. > >> L. > > You need to run some userspace tool that will open the watchdog > > device, and pet it periodically, telling the kernel that userspace is alive. > > > > If this tool will stop petting the watchdog, maybe because of a soft lockup > > or other trouble, the watchdog device will reset the VM. > > > > watchdog(8) may be the tool you need. > > > > See also > > https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.rst > > > > Nir > > > I do not think that 'i6300esb' watchog works under those > soft-lockups, whether it's qemu or OS end I cannot say. > With: > <watchdog model='i6300esb' action='reset'/> > in dom xml OS sees: > -> $ llr /dev/watchdog* > crw-------. 1 root root 10, 130 Apr 5 16:59 /dev/watchdog > crw-------. 1 root root 248, 0 Apr 5 16:59 /dev/watchdog0 > crw-------. 1 root root 248, 1 Apr 5 16:59 /dev/watchdog1 > and > -> $ wdctl > Device: /dev/watchdog > Identity: i6300ESB timer [version 0] > Timeout: 30 seconds > Pre-timeout: 0 seconds > FLAG DESCRIPTION STATUS BOOT-STATUS > KEEPALIVEPING Keep alive ping reply 1 0 > MAGICCLOSE Supports magic close char 0 0 > SETTIMEOUT Set timeout (in seconds) 0 0 > > If it worked, the HW watchdog, then 'i6300esb' should reset > the VM if nothing is pinging the watchdog - I read that it's > possible to exit 'software' watchdog and not to cause HW > watchdog take action. I do not know it that's happening here > when I just 'systemclt stop watchdog' > In '/etc/watchdog.conf' I do not point to any specific > device, which I believe makes watchdogd do its things. > Simple test: > -> $ cat >> /dev/watchdog > & 'Enter' press twice > does invoke 'reset' action and I was to believe 'wdctl' that > is HW watchdog working. But!... > The main issue I have are those "soft lockups" where VM's OS > becomes frozen, but nothing from the watchdog, no action - > though, as VM is in such frozen state host shows high CPU > for the VM. > > I do not anything fancy so I really wonder if what I see is > that rare. > Soft-lockup occur I think usually, cannot say that uniquely > though, during or after VM live-migration. > > thanks, L. On my fedora 35 vm, I see that /dev/watchdog0 is the right device: # wdctl Device: /dev/watchdog0 Identity: i6300ESB timer [version 0] Timeout: 30 seconds Pre-timeout: 0 seconds FLAG DESCRIPTION STATUS BOOT-STATUS KEEPALIVEPING Keep alive ping reply 1 0 MAGICCLOSE Supports magic close char 0 0 SETTIMEOUT Set timeout (in seconds) 0 0 I tested this script: # cat watchdog-test.py import os import time fd = os.open("/dev/watchdog0", os.O_WRONLY) print("Opened /dev/watchdog0") cat /etc/watchdog.conf | grep watchdog-device watchdog-device = /dev/watchdog0 for i in range(1, 120): time.sleep(1) print(i) # python3 watchdog-test.py Opened /dev/watchdog0 1 2 3 ... 30 The VM was reset after 30 seconds, showing that the hardware watchdog works. I also tested the watchdog package, with this configuration: # cat /etc/watchdog.conf ... watchdog-device = /dev/watchdog0 Then starting the service: # systemctl status watchdog ● watchdog.service - watchdog daemon Loaded: loaded (/usr/lib/systemd/system/watchdog.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2022-04-08 23:23:54 IDT; 7min ago Process: 757 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS) Main PID: 759 (watchdog) Tasks: 1 (limit: 2310) Memory: 616.0K CPU: 101ms CGroup: /system.slice/watchdog.service └─759 /usr/sbin/watchdog Apr 08 23:23:54 fedora35 watchdog[759]: interface: no interface to check Apr 08 23:23:54 fedora35 watchdog[759]: temperature: no sensors to check Apr 08 23:23:54 fedora35 watchdog[759]: no test binary files Apr 08 23:23:54 fedora35 watchdog[759]: no repair binary files Apr 08 23:23:54 fedora35 watchdog[759]: error retry time-out = 60 seconds Apr 08 23:23:54 fedora35 watchdog[759]: repair attempts = 1 Apr 08 23:23:54 fedora35 watchdog[759]: alive=/dev/watchdog0 heartbeat=[none] to=root no_act=no force=no Apr 08 23:23:54 fedora35 watchdog[759]: watchdog now set to 60 seconds Apr 08 23:23:54 fedora35 watchdog[759]: hardware watchdog identity: i6300ESB timer Apr 08 23:23:54 fedora35 systemd[1]: Started watchdog daemon. Finally, stopping the watchdog daemon: # kill -STOP 759 And the VM was reset in about 60 seconds. So I think it can work for your use case. You can try to find a way to trigger a soft lockup, or maybe crash the kernel to test this. Nir