Re: Races / crashes in shutdown of libvirtd daemon

Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx> · Tue, 28 Apr 2020 18:18:50 +0300

On 27.04.2020 18:54, Daniel P. Berrangé wrote:
> We got a new BZ filed about a libvirtd crash in shutdown
> 
>   https://bugzilla.redhat.com/show_bug.cgi?id=1828207
> 
> We can see from the stack trace that the "interface" driver is in
> the middle of servicing an RPC call for virConnectListAllInterfaces()
> 
> Meanwhile the libvirtd daemon is doing virObjectUnref(dmn) on the
> virNetDaemonPtr object.
> 
> The fact that it is doing this unref, means that it must have already
> call virStateCleanup(), given the code sequence:
> 
> 
>     /* Run event loop. */
>     virNetDaemonRun(dmn);
> 
>     ret = 0;
> 
>     virHookCall(VIR_HOOK_DRIVER_DAEMON, "-", VIR_HOOK_DAEMON_OP_SHUTDOWN,
>                 0, "shutdown", NULL, NULL);
> 
>  cleanup:
>     /* Keep cleanup order in inverse order of startup */
>     virNetDaemonClose(dmn);
> 
>     virNetlinkEventServiceStopAll();
> 
>     if (driversInitialized) {
>         /* NB: Possible issue with timing window between driversInitialized
>          * setting if virNetlinkEventServerStart fails */
>         driversInitialized = false;
>         virStateCleanup();
>     }
> 
>     virObjectUnref(adminProgram);
>     virObjectUnref(srvAdm);
>     virObjectUnref(qemuProgram);
>     virObjectUnref(lxcProgram);
>     virObjectUnref(remoteProgram);
>     virObjectUnref(srv);
>     virObjectUnref(dmn);
> 
> 
> Unless I'm missing something non-obvious, this cleanup code path is
> inherantly broken & racy.   When virNetDaemonRun() returns the RPC
> worker threads are all still active. They are all liable to still
> be executing RPC calls, which means any of the drivers may be in
> use. So calling  virStateCleanup() is an inherantly dangerous
> thing to do. There is the further complication that once we have
> exitted the main loop we may prevent the RPC calls from ever
> completing, as they may be waiting on an event to be dispatched.
> 
> I know we're had various patch proposals in the past to improve the
> robustness of shutdown cleanup but I can't remember the outcome of the
> reviews. Hopefully people involved in those threads can jump in here...

Hi, all!

Yeah I and John Ferlan attempted to address the issue in the past.

The last reference I found is [1]. One can dig down the history
of the issue thru the links inside.

IIRC the latest approach was similar to what you suggested below namely
run event loop for a while after quit is seen in order to let
RPC worker threads waiting for event to finish.

[1] https://www.redhat.com/archives/libvir-list/2018-July/msg00382.html

> 
> 
> IMHO the key problem here is the virNetDeamonRun() method which just
> looks at the "quit" flag and immediately returns if it is set.
> This needs to be changed so that when it sees quit == true, it takes
> the following actions
> 
>   1. Call virNetDaemonClose() to drop all RPC clients and thus prevent
>      new RPC calls arriving
>   2. Flush any RPC calls which are queued but not yet assigned to a
>      worker thread
>   3. Tell worker threads to exit after finishing their current job
>   4. Wait for all worker threads to exit
>   5. Now virNetDaemonRun may return
> 
> At this point we can call virStateCleanup and the various other
> things, as we know no drivers are still active in RPC calls.
> 
> Having said that, there could be background threads in the the
> drivers which are doing work that uses the event loop thread.
> 
> So we probably need a virStateClose() method that we call from
> virNetDaemonRun, *after* all worker threads are gone, which would
> cleanup any background threads while the event loop is still
> running.

I guess RPC worker threads may need some signal to finish in time too.
For example in case of migration.

Nikolay

> 
> 
> The issue is that step 4 above ("Wait for all worker threads to exit")
> may take too long, or indeed never complete.  To deal with this, it
> will need a timeout. In the remote_daemon.c cleanup code path, if
> there are still worker threads present, then we need to skip all
> cleanup and simply call _exit(0) to terminate the process with no
> attempt at cleanup, since it would be unsafe to try anything else.
> 
> Regards,
> Daniel
>