On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote: > Usually on domain shutdown event comes first from qemu > and the flag is unset in processSerialChangedEvent. However > there is a race between this function and qemuProcessHandleAgentEOF > because the former is executed in a thread pool and the latter > executed synchronously. qemuProcessHandleAgentEOF do not unset > the flag thus if it was triggered before it remains set after domain > shutdown/start and thus qemu agent becomes unavailable. What error occurred that would be avoided by clearing the agentError flag unconditionally? Considering that agentError being set is primarily used by "new connection" request processing (e.g. callers to qemuDomainAgentAvailable) in order to ascertain whether they should actually attempt the connection or fail because of some condition that would eventually lead to failure. I would think EOF from qemu_monitor for the qemu_agent would be one of those conditions *before* the domain state is not RUNNING or the priv->agent has been fully removed. I think what I'm missing is what sequence of events led to this condition? How was it reproduced. > --- > src/qemu/qemu_process.c | 1 + > 1 file changed, 1 insertion(+) > So I understand "eventually" your goal is to remove agentError in patch 6, but you need to lead down the path to get there. So let's see if I can follow along... The issue you chased is related to domain shutdown processing "order". The purpose up to this point in time for 'agentError' is to ensure callers of qemuDomainAgentAvailable() will fail before trying to use the agent. The qemuDomainAgentAvailable checks domain active, agentError, and whether the priv->agent is available. So the contention is during shutdown the EOF event firing at about the same time as the processSerialChangedEvent has caused an issue, but it's not 100% clear just from the description the sequence of events and the issue. So let's consider why these things happen: 1. The 'eofNotify' comes from the qemuMonitorIO function - ostensibly if the guest is shutting down 2. The 'processSerialChangedEvent' seems to occur for multiple reasons - startup, shutdown, some sort of channel failure (unix socket or pty). If I'm reading things correctly, this path is only involved when VSERPORT is being used. So, is this a case of guest startup where the initial agent connect/startup occurs (VSERPORT) and a guest shutdown racing? Or is this purely a shutdown ordering thing? > diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c > index 31c8453..a581f96 100644 > --- a/src/qemu/qemu_process.c > +++ b/src/qemu/qemu_process.c > @@ -151,6 +151,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, > > qemuAgentClose(agent); > priv->agent = NULL; Because this is cleared, then qemuDomainAgentAvailable will fail because the agent is not present instead of failing with there was some error. That would seem to make the following clearing "safe" since we know we'll have a failure any way, but... > + priv->agentError = false; How does this alone solve the problem you saw. AFAICT it just changes *where* the error occurs and the error message displayed. Conversely, if for some reason for either 1. qemuConnectAgent failed and is being "flagged" in 1a. processSerialChangedEvent (for initial CONNECTED failure when QEMU_CAPS_VSERPORT_CHANGE) 1b. qemuMigrationFinish (for migration target) 1c. qemuProcessReconnect (for libvirtd restart) 1d. qemuProcessLaunch (initial domain startup before QEMU_CAPS_VSERPORT_CHANGE) 1e. qemuProcessAttach (for qemu-attach to existing qemu process) or 2. qemuProcessHandleAgentError - some sort of error and we're stopping any further agent communication. then you're potentially clearing agentError and could conceivably cause "extra" debugging steps to be taken only to figure out the reason it was because of some error rather than the "next" connection failing and indicating the failure was because of some previous error (it's a startup race for some other thread desiring to use the agent). > > virObjectUnlock(vm); > return; > So given all that - I'm not 100% clear how this really fixes the issue you say. Although, I have to wonder if the bug you're chasing is perhaps that qemuConnectAgent doesn't clear agentError at the right time as patch 3 attempts to fix. After reading through patch 4 - I think Patch 4 and a slightly modified patch 3 will fix what you're seeing, but I'd only know for sure given more details about what was seen. John -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list