On Thu, Oct 08, 2020 at 06:25:32PM +0200, Lentes, Bernd wrote: > > > ----- On Oct 7, 2020, at 7:26 PM, Peter Crowther peter.crowther@xxxxxxxxxxxx wrote: > > > Bernd, another option would be a mismatch between the message that "virsh > > destroy" issues and the message that force_stop() in the pacemaker agent > > expects to receive. Pacemaker is trying to determine the success or failure of > > the destroy based on the concatenation of the text of the exit code and the > > text output by virsh; if either of those have changed between virsh versions, > > and especially if virsh destroy ever exits with a status other than zero, then > > you'll get that OCF error. > > > Do you know what $VIRSH_OPTIONS ends up as in your Pacemaker config, > > particularly whether --graceful is specified? > > > Cheers, > > > - Peter > > that means in the end that with "virsh destroy" i can't be 100% sure > that a domain is stopped. Assuming you do *NOT* use the --graceful flag, then libvirt will end up sending SIGKILL to QEMU if SIGTERM didn't cause it to quit. It is possible that QEMU will not die immediately even with SIGKILL, but you should get an error code back from virsh destroy in this scenario at least. On highly overcommitted hosts, the kernel may not reap the QEMU process quickly enough, but libvirt will definitely have delivered SIGKILL by the time the command returns. The only reasons why SIGKILL won't work eventually is if the process is stuck in an uninterruptable sleep in kernel space. This is typically seen for example, when the VM is doing I/O to a disk on NFS, and the NFS server is dead, and the NFS mount is set with "hard,nointr". There's nothing any app can do this in case really. If the host has a dead NFS mount you really need to be fencing the entire host. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|