Re: [PATCH 0/9] Resolve libvirtd hang on termination with connected long running client

Marc Hartmayer <mhartmay@xxxxxxxxxxxxx> · Tue, 03 Jul 2018 18:11:15 +0200

On Wed, Feb 14, 2018 at 11:36 PM +0100, John Ferlan <jferlan@xxxxxxxxxx> wrote:
> On 01/26/2018 03:47 AM, Nikolay Shirokovskiy wrote:
>>
>>
>> On 19.01.2018 20:23, John Ferlan wrote:
>>> RFC:
>>> https://www.redhat.com/archives/libvir-list/2018-January/msg00318.html
>>>
>>> Adjustments since RFC...
>>>
>>> Patches 1&2: No change, were already R-B'd
>>> Patch 3: Removed code as noted in code review, update commit message
>>> Patch 4: From old series removed, see below for more details
>>> Patch 9: no change
>>> NB: Patches 5-8 and 10 from Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx>
>>>     are removed as they seemed to not be necessary
>>>
>>> Replaced the former patch 4 with series of patches to (slowly) provide
>>> support to disable new connections, handle removing waiting jobs, causing
>>> the waiting workers to quit, and allow any running jobs to complete.
>>>
>>> As it turns out, waiting for running jobs to complete cannot be done
>>> from the virNetServerClose callbacks because that means the event loop
>>> processing done during virNetServerRun will not allow any currently
>>> long running worker job thread a means to complete.
>>
>> Hi, John.
>
> Sorry - been busy in other areas lately - trying to get back to this...
>
>>
>> So you suggest a different appoarch. Instead of introducing means to
>> help rpc threads to finish after event loop is finished (stateShutdown hook in my series)
>> you suggest to block futher new rpc processing and then waiting running
>> rpc calls to finish keeping event loop running for that purpose.
>
> Somewhere along the way in one of the reviews it was suggested to give
> the workers perhaps a few extra cycles to complete anything they might
> be doing. It was also suggested a mechanism to do that - which is what I
> tried to accomplish here.
>
> Based on that and that your mechanism was more of an "immediate
> separation" by IIRC closing the monitor connection and letting that
> close handler do whatever magic happens there.
>
> This not to say your mechanism is the wrong way to go about it, but it
> also hasn't garnered support nor have you actively attempted to champion
> it. So I presented an alternative approach.
>
>>
>> This could lead to libvirtd never finish its termination if one of
>> qemu processes do not respond for example. In case of long running
>> operation such as migration finishing can take much time. On the
>> over hand if we finish rpc threads abruptly as in case of stateShutdown hook
>> we will deal with all possible error paths on daemon finishing. May
>> be good approach is to abort long running operation, then give rpc threads
>> some time to finish as you suggest and only after that finish them abruptly to handle
>> hanging qemu etc.
>
> True, but it's also easy enough to add something to the last and updated
> patch to add the QuitTimer and force an exit without going through the
> rest of the "friendly" quit (IOW: more or less what Dan has suggested).
>
> There's an incredible amount of cycles being spent for what amounts to
> trying to be nice from a SIGTERM, SIGQUIT, or SIGINT - IOW: eventual death.
>
>
>>
>> Also in this approach we keep event loop running for rpc threads only
>> but there can be other threads that depend on the loop. For example if
>> qemu is rebooted we spawn a thread that executes qemuProcessFakeReboot
>> that sends commands to qemu (i.e. depends on event loop). We need to await
>> such threads finishing too while keep event loop running. In approach of
>> stateShutdown hook we finish all threads that uses qemu monitor by closing
>> the monitor.
>>
>> And hack with timeout in a loop... I think of a different aproach for waiting
>> rpc threads to finish their work. First let's make drain function only flush
>> job queue and take some callback which will be called when all workers will
>> be free from work (let's keep worker threads as Dan suggested). In this callback we
>> can use same technique as in virNetDaemonSignalHandler. That is make some
>> IO to set some flag in the event loop thread and cause virEventRunDefaultImpl
>> to finish and then test this flag in virNetDaemonRun.
>>
>
> Please feel free to spend as many cycles as you can making adjustments
> to what I have. I'm working through some alterations and posting a v2 of
> what I have and we'll see where it takes me/us.
>
> John

Is there any update so far? I’m asking because I’m still getting
segmentation faults and hang-ups on termination of libvirtd (using the
newest version of libvirt).

Example for a hang-up:
➤  bt
#0  0x000003fffca8df84 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x000003fffdac29ca in virCondWait (c=<optimized out>, m=<optimized out>) at ../../src/util/virthread.c:154
#2  0x000003fffdac381c in virThreadPoolFree (pool=<optimized out>) at ../../src/util/virthreadpool.c:290
#3  0x000003fffdbb21ae in virNetServerDispose (obj=0x1000cc640) at ../../src/rpc/virnetserver.c:803
#4  0x000003fffda97286 in virObjectUnref (anyobj=<optimized out>) at ../../src/util/virobject.c:350
#5  0x000003fffda97a5a in virObjectFreeHashData (opaque=<optimized out>, name=<optimized out>) at ../../src/util/virobject.c:591
#6  0x000003fffda66576 in virHashFree (table=<optimized out>) at ../../src/util/virhash.c:305
#7  0x000003fffdbaff82 in virNetDaemonDispose (obj=0x1000cc3c0) at ../../src/rpc/virnetdaemon.c:105
#8  0x000003fffda97286 in virObjectUnref (anyobj=<optimized out>) at ../../src/util/virobject.c:350
#9  0x0000000100026cd6 in main (argc=<optimized out>, argv=<optimized out>) at ../../src/remote/remote_daemon.c:1487

And segmentation faults happen for RPC jobs that are still running.

[…snip]

--
Beste Grüße / Kind regards
   Marc Hartmayer

IBM Deutschland Research & Development GmbH
Vorsitzende des Aufsichtsrats: Martina Koederitz
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list