On 27.04.2020 18:54, Daniel P. Berrangé wrote: > We got a new BZ filed about a libvirtd crash in shutdown > > https://bugzilla.redhat.com/show_bug.cgi?id=1828207 > > We can see from the stack trace that the "interface" driver is in > the middle of servicing an RPC call for virConnectListAllInterfaces() > > Meanwhile the libvirtd daemon is doing virObjectUnref(dmn) on the > virNetDaemonPtr object. > > The fact that it is doing this unref, means that it must have already > call virStateCleanup(), given the code sequence: > > > /* Run event loop. */ > virNetDaemonRun(dmn); > > ret = 0; > > virHookCall(VIR_HOOK_DRIVER_DAEMON, "-", VIR_HOOK_DAEMON_OP_SHUTDOWN, > 0, "shutdown", NULL, NULL); > > cleanup: > /* Keep cleanup order in inverse order of startup */ > virNetDaemonClose(dmn); > > virNetlinkEventServiceStopAll(); > > if (driversInitialized) { > /* NB: Possible issue with timing window between driversInitialized > * setting if virNetlinkEventServerStart fails */ > driversInitialized = false; > virStateCleanup(); > } > > virObjectUnref(adminProgram); > virObjectUnref(srvAdm); > virObjectUnref(qemuProgram); > virObjectUnref(lxcProgram); > virObjectUnref(remoteProgram); > virObjectUnref(srv); > virObjectUnref(dmn); > > > Unless I'm missing something non-obvious, this cleanup code path is > inherantly broken & racy. When virNetDaemonRun() returns the RPC > worker threads are all still active. They are all liable to still > be executing RPC calls, which means any of the drivers may be in > use. So calling virStateCleanup() is an inherantly dangerous > thing to do. There is the further complication that once we have > exitted the main loop we may prevent the RPC calls from ever > completing, as they may be waiting on an event to be dispatched. > > I know we're had various patch proposals in the past to improve the > robustness of shutdown cleanup but I can't remember the outcome of the > reviews. Hopefully people involved in those threads can jump in here... Hi, all! Yeah I and John Ferlan attempted to address the issue in the past. The last reference I found is [1]. One can dig down the history of the issue thru the links inside. IIRC the latest approach was similar to what you suggested below namely run event loop for a while after quit is seen in order to let RPC worker threads waiting for event to finish. [1] https://www.redhat.com/archives/libvir-list/2018-July/msg00382.html > > > IMHO the key problem here is the virNetDeamonRun() method which just > looks at the "quit" flag and immediately returns if it is set. > This needs to be changed so that when it sees quit == true, it takes > the following actions > > 1. Call virNetDaemonClose() to drop all RPC clients and thus prevent > new RPC calls arriving > 2. Flush any RPC calls which are queued but not yet assigned to a > worker thread > 3. Tell worker threads to exit after finishing their current job > 4. Wait for all worker threads to exit > 5. Now virNetDaemonRun may return > > At this point we can call virStateCleanup and the various other > things, as we know no drivers are still active in RPC calls. > > Having said that, there could be background threads in the the > drivers which are doing work that uses the event loop thread. > > So we probably need a virStateClose() method that we call from > virNetDaemonRun, *after* all worker threads are gone, which would > cleanup any background threads while the event loop is still > running. I guess RPC worker threads may need some signal to finish in time too. For example in case of migration. Nikolay > > > The issue is that step 4 above ("Wait for all worker threads to exit") > may take too long, or indeed never complete. To deal with this, it > will need a timeout. In the remote_daemon.c cleanup code path, if > there are still worker threads present, then we need to skip all > cleanup and simply call _exit(0) to terminate the process with no > attempt at cleanup, since it would be unsafe to try anything else. > > Regards, > Daniel >