Re: [PATCH v2 10/13] qemu: avoid deadlock in qemuDomainObjStopWorker

Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx> · Tue, 11 Aug 2020 09:39:47 +0300

On 10.08.2020 20:40, Daniel Henrique Barboza wrote:
> 
> 
> On 7/23/20 7:14 AM, Nikolay Shirokovskiy wrote:
>> We are dropping the only reference here so that the event loop thread
>> is going to be exited synchronously. In order to avoid deadlocks we
>> need to unlock the VM so that any handler being called can finish
>> execution and thus even loop thread be finished too.
>>
>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx>
>> ---
>>   src/qemu/qemu_domain.c | 18 ++++++++++++++----
>>   1 file changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>> index 5b22eb2..82b3d11 100644
>> --- a/src/qemu/qemu_domain.c
>> +++ b/src/qemu/qemu_domain.c
>> @@ -1637,11 +1637,21 @@ void
>>   qemuDomainObjStopWorker(virDomainObjPtr dom)
>>   {
>>       qemuDomainObjPrivatePtr priv = dom->privateData;
>> +    virEventThread *eventThread;
>>   -    if (priv->eventThread) {
>> -        g_object_unref(priv->eventThread);
>> -        priv->eventThread = NULL;
>> -    }
>> +    if (!priv->eventThread)
>> +        return;
>> +
>> +    /*
>> +     * We are dropping the only reference here so that the event loop thread
>> +     * is going to be exited synchronously. In order to avoid deadlocks we
>> +     * need to unlock the VM so that any handler being called can finish
>> +     * execution and thus even loop thread be finished too.
>> +     */
>> +    eventThread = g_steal_pointer(&priv->eventThread);
>> +    virObjectUnlock(dom);
>> +    g_object_unref(eventThread);
>> +    virObjectLock(dom);
> 
> I understand the intention of the code thanks to the comment just before
> it, but this is not robust. This unlocking -> do stuff -> lock will only
> work if the caller of qemuDomainObjStopWorker() is holding the mutex.
> 
> I see that this is the case for qemuDomainObjStopWorkerIter(), but
> qemuDomainObjStopWorker() is also called by qemuProcessStop(). qemuProcessStop()
> does not make any mutex lock/unlock, so you'll be assuming that all callers of
> qemuProcessStop() will hold the mutex before calling it. One of them is qemuProcessInit(),
> which calls qemuProcessStop() in the 'stop' jump, and there is no mutex lock
> beforehand as well.
> 
> Now we're assuming that all callers of qemuProcessInit() are holding the mutex
> lock beforehand. In a quick glance in the code I saw at least 2 instances that
> this isn't the case, then we'll need to assume that the callers of those functions
> are holding the mutex lock. So on and so forth.
> 
> 
> Given that this code only makes sense when called from qemuDomainObjStopWorkerIter(),
> I'd suggest removing the lock/unlock of this function (turning it into just a call
> to qemuDomainObjStopWorker()) and move them inside the body of qemuDomainObjStopWorker(),
> locking and unlocking the mutex inside the scope of the same function.
> 

Hi, Daniel.

Actually all callers of qemuProcessStop hold the lock. Moreover they even hold  
job condition. And calling qemuProcessStop without lock/job condition would be  
a mistake AFIU (qemuConnectDomainXMLToNative is the only exception I see that   
holds the lock but not the job condition. But this function creates new vm      
object that is not shared with other threads) I understand you concern but      
there are precedents. Take a look for example virDomainObjListRemove. The       
lock requirements are documented for virDomainObjListRemove and I can do it for 
qemuDomainObjStopWorker too but it looks a bit over documenting to me.   

Nikolay