On Tue, Jan 25, 2022 at 02:18:29PM -0500, Masayoshi Mizuma wrote:
Hello, I found an issue that libvirt isn't close an unix domain socket to connect to the qemu monitor even after the VM is destroyed. This issue happens since commit 695bdb3841 ("src: ensure GSource background unref happens in correct event loop") on the system whose glib version is 2.56. I would appreciate it if you could give any ideas to solve the issue.
My very rough and fast guess is that this happens due to our workaround while running on a system on which the fix was actually backported. You can remove the workaround (which is implemented for glib < 2.64.0), keep only the new code and see if it reproduces. I know it's not very scientific, but we already faced so many workarounds for old buggy glib that it's tiresome to try and fix all that when it goes away with update of glib. If the above does not help, then I would look at what could actually happening (maybe the unref does not get scheduled because the event loop it is requested on does not run any more, but that's just a guess).
The socket is allocated in qemuMonitorOpenUnix(), and used by the monitor->socket and monitor->watch: qemuMonitorOpen qemuMonitorOpenUnix if ((monfd = socket(AF_UNIX, SOCK_STREAM, 0)) < 0) { qemuMonitorOpenInternal mon->socket = g_socket_new_from_fd(fd, &gerr); qemuMonitorRegister mon->watch = g_socket_create_source(mon->socket, Usually, the socket is closed when the reference counter of the glib object (mon->socket and mon->watch) gets 0: qemuMonitorClose qemuMonitorUnregister vir_g_source_unref(mon->watch, mon->context); g_source_set_callback(idle, virEventGLibSourceUnrefIdle, src, NULL); virEventGLibSourceUnrefIdle g_source_unref(src); <== unref monitor->watch g_object_unref(mon->socket); <== unref monitor->socket It seems that the callback virEventGLibSourceUnrefIdle() to unref the monitor->watch doesn't work when qemuMonitorUnregister() is called via qemuProcessStop(), so the socket isn't closed. I'm not sure why the callback doesn't work at that time. I suspect that the VM is closing so the main loop of the monitor doesn't work any more. We can close the socket to add g_socket_close() before unref the mon->socket, however, it may remain a memory leak because of mon->watch (GSource object), so probably it isn't a good idea to close the socket. We can unref the mon->watch to set NULL to the second argument of vir_g_source_unref() because the default main loop still works at that time, however, I'm not sure it's an appropriate way to avoid the gobject issue which the commit solves... I found this issue on the qemu monitor, and probably the qemu agent has the same issue because the socket procedure is almost the same as the monitor. I would appreciate it if you could give any ideas to solve this issue. Following is to observe the callback working: --- diff --git a/src/util/glibcompat.c b/src/util/glibcompat.c index eb6dcc0111..b8b1770424 100644 --- a/src/util/glibcompat.c +++ b/src/util/glibcompat.c @@ -24,6 +24,9 @@ #include "glibcompat.h" +#include "virlog.h" + +VIR_LOG_INIT("util.glibcompat"); /* * Note that because of the GLIB_VERSION_MAX_ALLOWED constant in * config-post.h, allowing use of functions from newer GLib via @@ -244,6 +247,7 @@ virEventGLibSourceUnrefIdle(gpointer data) GSource *src = data; g_source_unref(src); + VIR_DEBUG("unrefed: %p", src); return FALSE; } @@ -257,6 +261,7 @@ void vir_g_source_unref(GSource *src, GMainContext *ctx) g_source_attach(idle, ctx); g_source_unref(idle); + VIR_DEBUG("unref registered: %p ctx: %p", src, ctx); } #endif --- Case the mon->watch (0xffff28008af0) is unreffed correctly (via qemuMonitorUpdateWatch()): 18:54:15.403+0000: 16845: debug : qemuMonitorEmitResume:1159 : mon=0xffff683ac020 18:54:15.403+0000: 16845: debug : qemuProcessHandleResume:713 : Transitioned guest test-vm into running state, reason 'booted', event detail 0 18:54:15.404+0000: 16845: debug : vir_g_source_unref:264 : unref registered: 0xffff28008af0 ctx: 0xffff780169a0 18:54:15.404+0000: 16845: debug : qemuMonitorJSONIOProcessLine:222 : Line [{"return": {}, "id": "libvirt-10"}] 18:54:15.404+0000: 16845: info : qemuMonitorJSONIOProcessLine:242 : QEMU_MONITOR_RECV_REPLY: mon=0xffff683ac020 reply={"return": {}, "id": "libvirt-10"} 18:54:15.404+0000: 16845: debug : vir_g_source_unref:264 : unref registered: 0xffff2819a260 ctx: 0xffff780169a0 18:54:15.404+0000: 16845: debug : virEventGLibSourceUnrefIdle:250 : unrefed: 0xffff28008af0 Case the mon->watch (0xffff7802bb30) isn't unreffed (via qemuProcessStop()): 18:54:15.642+0000: 16589: debug : qemuProcessStop:8008 : Shutting down vm=0xaaaad40edec0 name=test-vm id=3 pid=16842, reason=destroyed, asyncJob=none, flags=0x0 18:54:15.642+0000: 16589: debug : qemuDomainLogAppendMessage:6733 : Append log message (vm='test-vm' message='2022-01-25 18:54:15.642+0000: shutting down, reason=destroyed) stdioLogD=1 18:54:15.643+0000: 16589: info : qemuMonitorClose:834 : QEMU_MONITOR_CLOSE: mon=0xffff683ac020 18:54:15.643+0000: 16589: debug : vir_g_source_unref:264 : unref registered: 0xffff7802bb30 ctx: 0xffff780169a0 18:54:15.643+0000: 16845: debug : qemuMonitorJSONIOProcessEvent:209 : handle SHUTDOWN handler=0xffff4ef057c0 data=0xffff28007da0 18:54:15.643+0000: 16845: debug : qemuMonitorEmitShutdown:1132 : mon=0xffff683ac020 guest=2 18:54:15.643+0000: 16845: debug : qemuProcessHandleShutdown:572 : vm=0xaaaad40edec0 18:54:15.643+0000: 16845: debug : qemuProcessHandleShutdown:592 : Transitioned guest test-vm to shutdown state 18:54:15.643+0000: 16845: debug : qemuProcessKill:7921 : vm=0xaaaad40edec0 name=test-vm pid=16842 flags=0x2 18:54:15.643+0000: 16845: debug : qemuMonitorDispose:216 : mon=0xffff683ac020 18:54:15.643+0000: 16589: debug : qemuProcessKill:7921 : vm=0xaaaad40edec0 name=test-vm pid=16842 flags=0x5 18:54:15.643+0000: 16589: debug : qemuDomainCleanupRun:7313 : driver=0xffff281441b0, vm=test-vm 18:54:15.644+0000: 16589: debug : qemuProcessAutoDestroyRemove:8395 : vm=test-vm 18:54:15.645+0000: 16589: debug : qemuRemoveCgroup:1221 : Failed to terminate cgroup for test-vm 18:54:15.647+0000: 16589: debug : qemuDomainObjEndJob:1141 : Stopping job: destroy (async=none vm=0xaaaad40edec0 name=test-vm) Thanks! Masa
Attachment:
signature.asc
Description: PGP signature