On 10/23/2015 10:17 PM, Wang Yufei wrote: > On 2015/10/23 23:32, John Ferlan wrote: > >> >> >> On 10/13/2015 02:36 AM, Wang Yufei wrote: >>> On 2015/10/2 20:17, John Ferlan wrote: >>> >>>> >>>> >>>> On 09/26/2015 08:18 AM, Wang Yufei wrote: >>>>> We shutdown a VM A by qemu agent,meanwhile an agent EOF >>>>> of VM A happened, there's a chance that deadlock occurred: >>>>> >>>>> qemuProcessHandleAgentEOF in main thread >>>>> A) priv->agent = NULL; //A happened before B >>>>> >>>>> //deadlock when we get agent lock which's held by worker thread >>>>> qemuAgentClose(agent); >>>>> >>>>> qemuDomainObjExitAgent called by qemuDomainShutdownFlags in worker thread >>>>> B) hasRefs = virObjectUnref(priv->agent); //priv->agent is NULL, return false >>>>> >>>>> if (hasRefs) >>>>> virObjectUnlock(priv->agent); //agent lock will not be released here >>>>> >>>>> So I close agent first, then set priv->agent NULL to fix the deadlock. >>>>> >>>>> Signed-off-by: Wang Yufei <james.wangyufei@xxxxxxxxxx> >>>>> Reviewed-by: Ren Guannan <renguannan@xxxxxxxxxx> >>>>> --- >>>>> src/qemu/qemu_process.c | 3 +-- >>>>> 1 file changed, 1 insertion(+), 2 deletions(-) >>>>> >>>> >>>> Interesting - this is the exact opposite of commit id '1020a504' from >>>> Michal over 3 years ago. >>>> >>>> However, a bit of digging into the claim from the commit message drove >>>> me to commit id '362d04779c' which removes the domain object lock that >>>> was the basis of the initial patch. >>>> >>>> While I'm not an expert in the vmagent code, I do note that the >>>> HandleAgentEOF code checks for !priv->agent and priv->beingDestroyed, >>>> while the ExitAgent code doesn't necessarily (or directly) check whether >>>> the priv->agent is still valid (IOW: that nothing else has removed it >>>> already like the EOF). >>>> >>>> So, while I don't discount that the patch works - I'm wondering whether >>>> the smarts/logic should be built into ExitAgent to check for >>>> !priv->agent (and or something else) that would indicate we're already >>>> on the path of shutdown. >>>> >>>> John >>>> >>>> >>>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c >>>>> index f2586a1..8c9622e 100644 >>>>> --- a/src/qemu/qemu_process.c >>>>> +++ b/src/qemu/qemu_process.c >>>>> @@ -150,11 +150,10 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, >>>>> goto unlock; >>>>> } >>>>> >>>>> + qemuAgentClose(agent); >>>>> priv->agent = NULL; >>>>> >>>>> virObjectUnlock(vm); >>>>> - >>>>> - qemuAgentClose(agent); >>>>> return; >>>>> >>>>> unlock: >>>>> >>>> >>>> . >>>> >>> >>> Thank you for your reply. >>> >>> At first, we should consider about the right logic. In my oppinion, we should call >>> qemuAgentClose at first, then we set priv->agent NULL, just like the logic in qemuProcessStop. >>> >>> if (priv->agent) { >>> qemuAgentClose(priv->agent); >>> priv->agent = NULL; >>> priv->agentError = false; >>> } >>> >>> Base on the right logic, we consider about the right lock which I have shown in my patch. >>> >>> Whether the smarts/logic should be built into ExitAgent to check for !priv->agent (and or >>> something else) that would indicate we're already on the path of shutdown? >>> >>> The answer is yes, we have to, because in ExitAgent we use agent lock which may be released >>> by other thread like qemuAgentClose, we have to check the agent first to make sure it's >>> safe to visit. >>> >>> >> >> It doesn't feel right that moving qemuAgentClose to inside where the vm >> lock is held will fix the issue. Or does the virCondSignal (e.g. >> pthread_cond_signal or pthread_cond_broadcast) release the mutex that >> would be held by the thread that did the EnterAgent? >> >> So let's go back to your scenario again where A happens before B... As >> I see it, the bug is in B which has accesses and unreferences >> priv->agent *without* first getting the vm lock (like other instances), >> and assuming that priv->agent could still be valid (or the same) between >> the time the vm was unlocked in EnterAgent and the point at which we are >> at in ExitAgent. >> >> So the bug would then seem to be in the EnterAgent, run Agent command, >> ExitAgent sequence. Since we release the vm lock during EnterAgent, >> "anything" that could alter the priv->agent setting while the lock is >> not held needs to be handled. >> >> Thus perhaps the fix should be (in all callers) >> >> agent = priv->agent; >> EnterAgent(vm); >> *Agent* Command >> ExitAgent(vm, agent) >> >> Where in ExitAgent then has: >> >> virObjectLock(obj); >> if (priv->agent != agent) >> VIR_DEBUG("priv->agent=%p differs from agent=%p" >> priv->agent, agent); >> virObjectUnlock(agent); >> return; >> } >> >> if ((hasRefs = virObjectUnref(priv->agent))) >> virObjectUnlock(priv->agent); >> >> VIR_DEBUG("Exited agent (agent=%p vm=%p name=%s hasRefs=%d)", >> priv->agent, obj, obj->def->name, hasRefs); >> >> priv->agentStart = 0; >> if (!hasRefs) >> priv->agent = NULL; >> >> >> Does this seem reasonable? >> >> John >> >> . > > > Let's see the order of lock if we modify it in your way: > Thread B > require agent lock (qemuDomainObjEnterAgent) > require vm lock (qemuDomainObjExitAgent, still hold agent lock) > > Thread A > require vm lock (qemuProcessHandleAgentEOF) > require agent lock (qemuAgentClose, still hold vm lock) > > Bomb, dead lock! Am I right? We must keep lock require in the same order. > First off - I've removed the fix you proposed from my environment... If Thread B has the agent-lock and gets vm-lock, then Thread A cannot get vm-lock until Thread B releases. So how/when does Thread A get to agent lock (qemuAgentClose)? Well once Thread B is done with it during virDomainObjEndAPI. So where's the deadlock? By that time, Thread B has also released the agent-lock. Conversely, if Thread A (EOF) gets the vm-lock, then Thread B cannot get it until Thread A releases. Once Thread A releases and Thread B gets it, then it can determine whether the priv->agent is the same (or valid). In the EOF code you "could" have the agent-lock on entrance, but you need the vm-lock in order to clear out the agent from the vm. Calling qemuAgentClose with the agent passed from the caller (either a shutdown or some sort of catastrophic failure). So without any changes we currently have the following: EnterAgent w/ vm-lock get agent-lock vm-unlock ExitAgent w/ agent-lock unref/check agent refs if there are refs unlock-agent get vm-lock adjust vm agent related fields EOF (may have agent-lock) acquire vm-lock check vm agent fields clear vm->priv->agent unlock-vm AgentClose(agent) I think the current ordering is correct; however, the problem is the ExitAgent *assumes* priv->agent is valid without getting the vm-lock. As you pointed out in your commit message, if EOF runs, it clears priv->agent (with the vm-lock). If we drop back even further, once we release the vm-lock after EnterAgent, should not 'assume' priv->agent is valid... Probably easier if I post some patches to continue debate from there. John -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list