On Wed, Mar 30, 2022 at 01:11:50AM +0000, Tobias Hofmann (tohofman) wrote: > Hello all, > > I have a system with one VM running. After some time the VM needs to be > temporarily stopped and started again. This start operation fails and > from that point on any virsh command is hanging and does not execute. > This issue is reproducible and I have already figured out that restarting > libvirtd resolves this issue. However, I’m now trying to understand why > it’s getting stuck in the first place. > > I try not to get too much into detail because I think this would be more > confusing than it would actually help to understand the problem. In > general, I’m wondering what approach you should follow to debug why > libvirt gets stuck. > Online I’ve read that you should run this command: `# gdb -batch -p > $(pidof libvirtd) -ex 't a a bt f'`. I’ve run that command and attached > the output to this mail. However, I have to admit that I have no idea > what to do with it. That's perfect. In that trace we can see two threads of interest One thread is attempting to start the QEMU instance: Thread 18 (Thread 0x7f58bcf3e700 (LWP 19010)): #0 0x00007f58c9e3a9f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007f58ccc79e26 in virCondWait () from /lib64/libvirt.so.0 #2 0x00007f58a046f70b in qemuMonitorSend () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #3 0x00007f58a04811d0 in qemuMonitorJSONCommandWithFd () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #4 0x00007f58a0482a01 in qemuMonitorJSONSetCapabilities () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #5 0x00007f58a044f567 in qemuConnectMonitor () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #6 0x00007f58a04506f8 in qemuProcessWaitForMonitor () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #7 0x00007f58a0456a52 in qemuProcessLaunch () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #8 0x00007f58a045a4b2 in qemuProcessStart () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #9 0x00007f58a04bd5c6 in qemuDomainObjStart.constprop.50 () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #10 0x00007f58a04bdd26 in qemuDomainCreateWithFlags () from /usr/lib64/libvirt/connection-driver/libvirt_driver_qemu.so #11 0x00007f58cce1524c in virDomainCreate () from /lib64/libvirt.so.0 #12 0x000055e7499a3da3 in remoteDispatchDomainCreateHelper () #13 0x00007f58ccd461d5 in virNetServerProgramDispatch () from /lib64/libvirt.so.0 #14 0x00007f58ccd4c9ad in virNetServerHandleJob () from /lib64/libvirt.so.0 #15 0x00007f58ccc7a831 in virThreadPoolWorker () from /lib64/libvirt.so.0 #16 0x00007f58ccc79bb8 in virThreadHelper () from /lib64/libvirt.so.0 #17 0x00007f58c9e36e65 in start_thread () from /lib64/libpthread.so.0 #18 0x00007f58c9b5f92d in clone () from /lib64/libc.so.6 This shows we have started QEMU and are trying to connect to the monitor. We have queued some data to send and are waiting for the event loop to send it and get the reply back. The next thread of interest is Thread 1 (Thread 0x7f58cd74c8c0 (LWP 19008)): #0 0x00007f58c9e3a35e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0 #1 0x00007f58a0b779e7 in nwfilterStateReload () from /usr/lib64/libvirt/connection-driver/libvirt_driver_nwfilter.so #2 0x00007f58a0b77aa5 in nwfilterFirewalldDBusFilter () from /usr/lib64/libvirt/connection-driver/libvirt_driver_nwfilter.so #3 0x00007f58ca8f6c5e in dbus_connection_dispatch () from /lib64/libdbus-1.so.3 #4 0x00007f58ccc0e421 in virDBusWatchCallback () from /lib64/libvirt.so.0 #5 0x00007f58ccc193a7 in virEventPollRunOnce () from /lib64/libvirt.so.0 #6 0x00007f58ccc17a72 in virEventRunDefaultImpl () from /lib64/libvirt.so.0 #7 0x00007f58ccd4c22d in virNetDaemonRun () from /lib64/libvirt.so.0 #8 0x000055e7499758b9 in main () This is the main event loop thread. Here we see that it is running a reload operation for the nwfilter driver. This is triggered when something restarts the firewalld daemon, or requests that it flushes & reloads its rules. This is where the problem is. The thread starting the QEMU process is likely holding a read lock on the nwfilter driver, and is waiting for the event loop thread to finish monitor I/O. The event loop thread though is blocked waiting for a write lock on the nwfilter driver, and this will never come available because the othre thread holds a read lock and is waiting for this thread. So the root problem here is that we should not be running the reload code for the nwfilter driver in the main event loop thread, because it blocks the entire event loop preventing othre threads making process, and is also dependant on those othuer threads releasing locks. I'm really surprised no one has reported this problem to us before since its existed forever AFAICT. With current libvirt it should be harmless, since each QEMU process now gets its own dedicated event loop, so would not be blocked by the main event loop being stalled. I think we should none the less fix this by moving the nwfilter reload code to a background thread, as anything running in the main event thread is expected to be completing very fast. Back to why you get into this scenario. I guess something on your system is triggering a firewall reload or restart at exactly the same time as you are starting the VM. This is pretty strange, unless you've somehow got this triggered from a hook related to the VM startup/shutdown ? With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|