On 2015/10/23 23:32, John Ferlan wrote: > > > On 10/13/2015 02:36 AM, Wang Yufei wrote: >> On 2015/10/2 20:17, John Ferlan wrote: >> >>> >>> >>> On 09/26/2015 08:18 AM, Wang Yufei wrote: >>>> We shutdown a VM A by qemu agent,meanwhile an agent EOF >>>> of VM A happened, there's a chance that deadlock occurred: >>>> >>>> qemuProcessHandleAgentEOF in main thread >>>> A) priv->agent = NULL; //A happened before B >>>> >>>> //deadlock when we get agent lock which's held by worker thread >>>> qemuAgentClose(agent); >>>> >>>> qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread >>>> B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false >>>> >>>> if (hasRefs) >>>> virObjectUnlock(priv->agent); //agent lock will not be released here >>>> >>>> So I close agent first, then set priv->agent NULL to fix the deadlock. >>>> >>>> Signed-off-by: Wang Yufei <james.wangyufei@xxxxxxxxxx> >>>> Reviewed-by: Ren Guannan <renguannan@xxxxxxxxxx> >>>> --- >>>> src/qemu/qemu_process.c | 3 +-- >>>> 1 file changed, 1 insertion(+), 2 deletions(-) >>>> >>> >>> Interesting - this is the exact opposite of commit id '1020a504' from >>> Michal over 3 years ago. >>> >>> However, a bit of digging into the claim from the commit message drove >>> me to commit id '362d04779c' which removes the domain object lock that >>> was the basis of the initial patch. >>> >>> While I'm not an expert in the vmagent code, I do note that the >>> HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed, >>> while the ExitAgent code doesn't necessarily (or directly) check whether >>> the priv->agent is still valid (IOW: that nothing else has removed it >>> already like the EOF). >>> >>> So, while I don't discount that the patch works - I'm wondering whether >>> the smarts/logic should be built into ExitAgent to check for >>> !priv->agent (and or something else) that would indicate we're already >>> on the path of shutdown. >>> >>> John >>> >>> >>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c >>>> index f2586a1..8c9622e 100644 >>>> --- a/src/qemu/qemu_process.c >>>> +++ b/src/qemu/qemu_process.c >>>> @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, >>>> goto unlock; >>>> } >>>> >>>> + qemuAgentClose(agent); >>>> priv->agent = NULL; >>>> >>>> virObjectUnlock(vm); >>>> - >>>> - qemuAgentClose(agent); >>>> return; >>>> >>>> unlock: >>>> >>> >>> . >>> >> >> Thank you for your reply. >> >> At first, we should consider about the right logic. In my oppinion, we should call >> qemuAgentClose at first, then we set priv->agent NULL, just like the logic in qemuProcessStop. >> >> if (priv->agent) { >> qemuAgentClose(priv->agent); >> priv->agent = NULL; >> priv->agentError = false; >> } >> >> Base on the right logic, we consider about the right lock which I have shown in my patch. >> >> Whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or >> something else) that would indicate we're already on the path of shutdown? >> >> The answer is yes, we have to, because in ExitAgent we use agent lock which may be released >> by other thread like qemuAgentClose, we have to check the agent first to make sure it's >> safe to visit. >> >> > > It doesn't feel right that moving qemuAgentClose to inside where the vm > lock is held will fix the issue. Or does the virCondSignal (e.g. > pthread_cond_signal or pthread_cond_broadcast) release the mutex that > would be held by the thread that did the EnterAgent? > > So let's go back to your scenario again where A happens before B... As > I see it, the bug is in B which has accesses and unreferences > priv->agent *without* first getting the vm lock (like other instances), > and assuming that priv->agent could still be valid (or the same) between > the time the vm was unlocked in EnterAgent and the point at which we are > at in ExitAgent. > > So the bug would then seem to be in the EnterAgent, run Agent command, > ExitAgent sequence. Since we release the vm lock during EnterAgent, > "anything" that could alter the priv->agent setting while the lock is > not held needs to be handled. > > Thus perhaps the fix should be (in all callers) > > agent = priv->agent; > EnterAgent(vm); > *Agent* Command > ExitAgent(vm, agent) > > Where in ExitAgent then has: > > virObjectLock(obj); > if (priv->agent != agent) > VIR_DEBUG("priv->agent=%p differs from agent=%p" > priv->agent, agent); > virObjectUnlock(agent); > return; > } > > if ((hasRefs = virObjectUnref(priv->agent))) > virObjectUnlock(priv->agent); > > VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s hasRefs=%d)", > priv->agent, obj, obj->def->name, hasRefs); > > priv->agentStart = 0; > if (!hasRefs) > priv->agent = NULL; > > > Does this seem reasonable? > > John > > . Let's see the order of lock if we modify it in your way: Thread B require agent lock (qemuDomainObjEnterAgent) require vm lock (qemuDomainObjExitAgent, still hold agent lock) Thread A require vm lock (qemuProcessHandleAgentEOF) require agent lock (qemuAgentClose, still hold vm lock) Bomb, dead lock! Am I right? We must keep lock require in the same order. -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list