On 21.12.2017 15:57, John Ferlan wrote: > [...] > >>> >>> So short story made really long, I think the best course of action will >>> be to add this patch and reorder the Unref()'s (adminProgram thru srv, >>> but not dmn). It seems to resolve these corner cases, but I'm also open >>> to other suggestions. Still need to think about it some more too before >>> posting any patches. >>> >>> >> Hi. >> >> I'm not grasp the whole picture yet but I've managed to find out what >> triggered the crash. It is not 2f3054c22 where you reordered unrefs but >> 1fd1b766105 which moves events unregistering from netserver client closing to >> netservec client disposing. Before 1fd1b766105 we don't have crash >> because clients simply do not get disposed. > > Oh yeah, that one.... But considering Erik's most recent response in > this overall thread vis-a-vis the separation of "close" vs. "dispose" > and the timing of each w/r/t Unref and Free, I think having the call to > remoteClientFreePrivateCallbacks in remoteClientCloseFunc is perhaps > better than in remoteClientFreeFunc. > >> >> As to fixing the crash with this patch I thinks its is coincidence. I want >> do dispose netservers early to join rpc threads and it turns out that >> disposing also closing clients too and this fixes the problem. >> >> Nikolay >> > > With Cedric's patch in place, the virt-manager client issue is fixed. So > that's goodness. > > If I then add the sleep (or usleep) into qemuConnectGetAllDomainStats as > noted in what started this all, then I can either get libvirtd to crash > dereferencing a NULL driver pointer or (my favorite) hang with two > threads stuck waiting: > > (gdb) t a a bt > > Thread 5 (Thread 0x7fffe535b700 (LWP 15568)): > #0 0x00007ffff3dc909d in __lll_lock_wait () from /lib64/libpthread.so.0 > #1 0x00007ffff3dc1e23 in pthread_mutex_lock () from /lib64/libpthread.so.0 > #2 0x00007ffff7299a15 in virMutexLock (m=<optimized out>) > at util/virthread.c:89 > #3 0x00007fffc760621e in qemuDriverLock (driver=0x7fffbc190510) > at qemu/qemu_conf.c:100 > #4 virQEMUDriverGetConfig (driver=driver@entry=0x7fffbc190510) > at qemu/qemu_conf.c:1002 > #5 0x00007fffc75dfa89 in qemuDomainObjBeginJobInternal ( > driver=driver@entry=0x7fffbc190510, obj=obj@entry=0x7fffbc3bcd60, > job=job@entry=QEMU_JOB_QUERY, > asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE) > at qemu/qemu_domain.c:4690 > #6 0x00007fffc75e3b2b in qemuDomainObjBeginJob ( > driver=driver@entry=0x7fffbc190510, obj=obj@entry=0x7fffbc3bcd60, > job=job@entry=QEMU_JOB_QUERY) at qemu/qemu_domain.c:4842 > #7 0x00007fffc764f744 in qemuConnectGetAllDomainStats > (conn=0x7fffb80009a0, > doms=<optimized out>, ndoms=<optimized out>, stats=<optimized out>, > retStats=0x7fffe535aaf0, flags=<optimized out>) at > qemu/qemu_driver.c:20219 > #8 0x00007ffff736430a in virDomainListGetStats (doms=0x7fffa8000950, > stats=0, > retStats=retStats@entry=0x7fffe535aaf0, flags=0) at > libvirt-domain.c:11595 > #9 0x000055555557948d in remoteDispatchConnectGetAllDomainStats ( > server=<optimized out>, msg=<optimized out>, ret=0x7fffa80008e0, > args=0x7fffa80008c0, rerr=0x7fffe535abf0, client=<optimized out>) > at remote.c:6538 > #10 remoteDispatchConnectGetAllDomainStatsHelper (server=<optimized out>, > client=<optimized out>, msg=<optimized out>, rerr=0x7fffe535abf0, > args=0x7fffa80008c0, ret=0x7fffa80008e0) at remote_dispatch.h:615 > #11 0x00007ffff73bf59c in virNetServerProgramDispatchCall > (msg=0x55555586cdd0, > client=0x55555586bea0, server=0x55555582ed90, prog=0x555555869190) > at rpc/virnetserverprogram.c:437 > #12 virNetServerProgramDispatch (prog=0x555555869190, > server=server@entry=0x55555582ed90, client=0x55555586bea0, > msg=0x55555586cdd0) at rpc/virnetserverprogram.c:307 > #13 0x00005555555a9318 in virNetServerProcessMsg (msg=<optimized out>, > prog=<optimized out>, client=<optimized out>, srv=0x55555582ed90) > at rpc/virnetserver.c:148 > #14 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x55555582ed90) > at rpc/virnetserver.c:169 > #15 0x00007ffff729a521 in virThreadPoolWorker ( > opaque=opaque@entry=0x55555583aa40) at util/virthreadpool.c:167 > #16 0x00007ffff7299898 in virThreadHelper (data=<optimized out>) > at util/virthread.c:206 > #17 0x00007ffff3dbf36d in start_thread () from /lib64/libpthread.so.0 > #18 0x00007ffff3af3e1f in clone () from /lib64/libc.so.6 > > Thread 1 (Thread 0x7ffff7ef9d80 (LWP 15561)): > #0 0x00007ffff3dc590b in pthread_cond_wait@@GLIBC_2.3.2 () > ---Type <return> to continue, or q <return> to quit--- > from /lib64/libpthread.so.0 > #1 0x00007ffff7299af6 in virCondWait (c=<optimized out>, m=<optimized out>) > at util/virthread.c:154 > #2 0x00007ffff729a760 in virThreadPoolFree (pool=<optimized out>) > at util/virthreadpool.c:290 > #3 0x00005555555a8ec2 in virNetServerDispose (obj=0x55555582ed90) > at rpc/virnetserver.c:767 > #4 0x00007ffff727923b in virObjectUnref (anyobj=<optimized out>) > at util/virobject.c:356 > #5 0x00007ffff724f069 in virHashFree (table=<optimized out>) > at util/virhash.c:318 > #6 0x00007ffff73b8295 in virNetDaemonDispose (obj=0x55555582eb10) > at rpc/virnetdaemon.c:105 > #7 0x00007ffff727923b in virObjectUnref (anyobj=<optimized out>) > at util/virobject.c:356 > #8 0x000055555556f2eb in main (argc=<optimized out>, argv=<optimized out>) > at libvirtd.c:1524 > (gdb) > > > Of course this could be a red herring because sleep/usleep and the > condition handling nature of these jobs could be interfering with one > another. > > Still adding the "virHashRemoveAll(dmn->servers);" into > virNetDaemonClose doesn't help the situation as I can still either crash > randomly or hang, so I'm less convinced this would really fix anything. > It does change the "nature" of the hung thread stack trace though, as > the second thread is now: virHashRemoveAll is not enough now. Due to unref reordeing last ref to @srv is unrefed after virStateCleanup. So we need to virObjectUnref(srv|srvAdm) before virStateCleanup. Or we can call virThreadPoolFree from virNetServerClose ( as in the first version of the patch and as Erik suggests) instead of virHashRemoveAll. > > Thread 1 (Thread 0x7ffff7ef9d80 (LWP 20159)): > #0 0x00007ffff3dc590b in pthread_cond_wait@@GLIBC_2.3.2 () > ---Type <return> to continue, or q <return> to quit--- > from /lib64/libpthread.so.0 > #1 0x00007ffff7299b06 in virCondWait (c=<optimized out>, m=<optimized out>) > at util/virthread.c:154 > #2 0x00007ffff729a770 in virThreadPoolFree (pool=<optimized out>) > at util/virthreadpool.c:290 > #3 0x00005555555a8ec2 in virNetServerDispose (obj=0x55555582ed90) > at rpc/virnetserver.c:767 > #4 0x00007ffff727924b in virObjectUnref (anyobj=<optimized out>) > at util/virobject.c:356 > #5 0x000055555556f2e3 in main (argc=<optimized out>, argv=<optimized out>) > at libvirtd.c:1523 > (gdb) > > > So we still haven't found the "root cause", but I think Erik is on to > something in the other part of this thread. I'll go there. > > > John > -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list