[...] >> >> Now you've lost me. What are the back traces? and now does one >> reasonably reproduce? Are you trying to advocate here for [2] to be >> reviewed/accepted? > > Sorry for that. Unfortunately back trace was only emailed in the first version > of series [3]. The reproducer is there too. > > I'm not here for [2]. It addresses a different issue. However I want to > make some progress on that series too. > > [3] https://www.redhat.com/archives/libvir-list/2017-September/msg01006.html > Ahh, sigh... I should have been more "proactive" in ensuring that I had gone back through the above test when two of the patches I had posted related to the changes in this code were not ACK'd. Those two patches are: https://www.redhat.com/archives/libvir-list/2017-November/msg00296.html and https://www.redhat.com/archives/libvir-list/2017-November/msg00297.html What those did was perform the Unref from libvirtd as soon as the AddServer or AddProgram added to the respective hash table (and of course incr the refcnt). The result is the Unref for each was done *prior* to the virNetDaemonClose and prior to the virStateCleanup which solves the late run after Close when Dispose runs. What jiggled my memory on this was today there was an OFTC IRC posting regarding a virt-manager crash (see pastebin from cbosdonnat at http://paste.opensuse.org/44522138) as a result of incorrect ordering of daemon unref and running virStateCleanup. A copy/paste of the subsequent conversation is: <cbosdonnat> I'm seeing a crash here when libvirtd is stopping and a client (namely virt-manager) calls networkConnectNetworkEventDeregisterAny().... is it OK to just goto cleanup if the driver is NULL in that function? <danpb> cbosdonnat: you mean the driver is being torn down in virStateCleanup ? and an API is still allowed to be run ? <cbosdonnat> danpb, looks like that indeed <cbosdonnat> where would the API call be rejected otherwise? <danpb> that seems a much more general problem - we should not call virStateCleanup until we've killed off all the API worker threads <danpb> i kind of thought we already did that <danpb> is it perhaps that we're killing off active clients and as part of that we're cleaning up events they have registered ? <cbosdonnat> danpb, here is the backtrace of the crash: http://paste.opensuse.org/44522138 <cbosdonnat> so seems like that indeed <danpb> oh yeah, we're killing off clients <danpb> we need to explicitly be able to kill off clients before shutting down drivers <danpb> surprised we've not seen this before to be honest <danpb> i wonder if some refactoring has introduced this problem recently <cbosdonnat> are those done in separate threads? <danpb> i can't rememeber to be honest <cbosdonnat> I'll chase the order of these then <cbosdonnat> danpb, the state cleanup happens before the daemon unref... looks like we'll need to purge the daemon before that <cbosdonnat> danpb, would that be completely silly to move the client closing code to virNetServerClose()? <danpb> that seems pretty sensible to me actually <danpb> logically they feel related actions <cbosdonnat> we can keep the unref of them where it is though if it makes more sense <cbosdonnat> yep So yes, the "refactoring" that caused this is commit id '2f3054c22'. As a test, I altered the code to perform the all the Unref's before the virStateCleanup and that "resolved" the virt-manager crash; however, that would seem to conflict with commit id '85c3a1820' which noted that performing the last Unref(dmn) prior to virStateCleanup would cause issues for daemonInhibitCallback. So I moved the Unref(dmn) after virStateCleanup... That of course led to the virt-manager issue. So I figured I'd give adding this patch in and that then fixed the virt-manager crash (obviously as we'd be Unref'ing all those servers we added). So after all that if I add that sleep into domstats as shown in the [3] link from above - I don't get a crash, but it also seems to cause the SIGTERM to be ignored at least the first time through. Running a second SIGTERM does the proper kill. So short story made really long, I think the best course of action will be to add this patch and reorder the Unref()'s (adminProgram thru srv, but not dmn). It seems to resolve these corner cases, but I'm also open to other suggestions. Still need to think about it some more too before posting any patches. John [...] -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list