On 27.10.2016 11:21, Nikolay Shirokovskiy wrote: > > > On 26.10.2016 22:48, John Ferlan wrote: >> >> >> On 10/04/2016 02:56 AM, Nikolay Shirokovskiy wrote: >>> Usually on domain shutdown event comes first from qemu >>> and the flag is unset in processSerialChangedEvent. However >>> there is a race between this function and qemuProcessHandleAgentEOF >>> because the former is executed in a thread pool and the latter >>> executed synchronously. qemuProcessHandleAgentEOF do not unset >>> the flag thus if it was triggered before it remains set after domain >>> shutdown/start and thus qemu agent becomes unavailable. >> >> >> What error occurred that would be avoided by clearing the agentError >> flag unconditionally? Considering that agentError being set is primarily > > In short qemu agent becomes unavailable after shutdown. > Long error case description is next: > > 1. agent error occurred and error flag is set. > 2. domain is shutted down. Shutdown follows the path on which error flag stays set. > (Actually error flag can be set in the process of shutdown, > which is I guess the case that I observed.) > > 3. domain is started, on any attempt to use agent you got > "QEMU guest agent is not available due to an error". > > As nobody clears error flag on start up it stays set. Obviously > we don't need to clear on the first domain start after daemon start > because qemuDomainObjPrivate is allocated and cleared to zeros. > But restarts rely upon assumption that this flag is cleared on shutdown. > And this is not true. There are shutdown paths where flags is not cleared. > This patch addresses one of this paths: > > > 1. shutdown started, in the process VSERPORT_CHANGE is delivered in > IO loop thread. However it is handled in a different thread ( > from driver->workerPool). > 2. in the same IO loop thread we got EOF from agent and synchronously > execute qemuProcessHandleAgentEOF. It closes agent, set priv->agent > to NULL but forgets to clear agentError. > 3. Now we execute VSERPORT_CHANGE handler(processSerialChangedEvent). > It is delayed because it is asynchronous to VSERPORT_CHANGE itself. > As priv->agent is NULL handler does not touch agentError flag. > > I guess usually time gap between VSERPORT_CHANGE and EOF is enough > and step 3 is executed before step 2 and agentError is cleared > upon shutdown. > > Basically it is enough to clear the flag on start up (patch 3) but I > thought just for the case we would want to clear the flag on agent cleanup > too. > >> used by "new connection" request processing (e.g. callers to >> qemuDomainAgentAvailable) in order to ascertain whether they should >> actually attempt the connection or fail because of some condition that >> would eventually lead to failure. I would think EOF from qemu_monitor >> for the qemu_agent would be one of those conditions *before* the domain >> state is not RUNNING or the priv->agent has been fully removed. > > Don't really understand the part 'f 'before state in not RUNNING'. >>From what I've seen in code agentError has 2 meanings, they described > in patch 5. > >> >> I think what I'm missing is what sequence of events led to this >> condition? How was it reproduced. >> >>> --- >>> src/qemu/qemu_process.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >> >> So I understand "eventually" your goal is to remove agentError in patch >> 6, but you need to lead down the path to get there. So let's see if I >> can follow along... > > My original intention was to fix the bug that I tried to > explain above. But in the process I found the other path how > similar error can occur again (patch 2). Then I realized that > we can protect ourselfs from similar bugs just in one place (patch 3). > At this point I look into the code long enough to realize > one don't need the flag altogether and decided that detail explanation > of patch 1 and 2 helps me convince others to get rid of the flag. > >> >> The issue you chased is related to domain shutdown processing "order". >> >> The purpose up to this point in time for 'agentError' is to ensure >> callers of qemuDomainAgentAvailable() will fail before trying to use the >> agent. The qemuDomainAgentAvailable checks domain active, agentError, >> and whether the priv->agent is available. >> >> So the contention is during shutdown the EOF event firing at about the >> same time as the processSerialChangedEvent has caused an issue, but it's >> not 100% clear just from the description the sequence of events and the >> issue. So let's consider why these things happen: >> >> 1. The 'eofNotify' comes from the qemuMonitorIO function - ostensibly if >> the guest is shutting down >> >> 2. The 'processSerialChangedEvent' seems to occur for multiple reasons - >> startup, shutdown, some sort of channel failure (unix socket or pty). If >> I'm reading things correctly, this path is only involved when VSERPORT >> is being used. >> >> So, is this a case of guest startup where the initial agent >> connect/startup occurs (VSERPORT) and a guest shutdown racing? Or is >> this purely a shutdown ordering thing? > > Purely shutdown. And error in io loop from agent before shutdown > or in the process. > >> >>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c >>> index 31c8453..a581f96 100644 >>> --- a/src/qemu/qemu_process.c >>> +++ b/src/qemu/qemu_process.c >>> @@ -151,6 +151,7 @@ qemuProcessHandleAgentEOF(qemuAgentPtr agent, >>> >>> qemuAgentClose(agent); >>> priv->agent = NULL; >> >> Because this is cleared, then qemuDomainAgentAvailable will fail because >> the agent is not present instead of failing with there was some error. >> That would seem to make the following clearing "safe" since we know >> we'll have a failure any way, but... >> >>> + priv->agentError = false; >> >> How does this alone solve the problem you saw. AFAICT it just changes >> *where* the error occurs and the error message displayed. > > The place that is affected is the following domain start up and agent use. > >> >> Conversely, if for some reason for either >> >> 1. qemuConnectAgent failed and is being "flagged" in >> 1a. processSerialChangedEvent (for initial CONNECTED failure when >> QEMU_CAPS_VSERPORT_CHANGE) >> 1b. qemuMigrationFinish (for migration target) >> 1c. qemuProcessReconnect (for libvirtd restart) >> 1d. qemuProcessLaunch (initial domain startup before >> QEMU_CAPS_VSERPORT_CHANGE) >> 1e. qemuProcessAttach (for qemu-attach to existing qemu process) >> >> or >> >> 2. qemuProcessHandleAgentError - some sort of error and we're stopping >> any further agent communication. >> >> then you're potentially clearing agentError and could conceivably cause >> "extra" debugging steps to be taken only to figure out the reason it was >> because of some error rather than the "next" connection failing and >> indicating the failure was because of some previous error (it's a >> startup race for some other thread desiring to use the agent). > > As to case 1. AFAIU you talking about next case: > > 1. agent started (qemuAgentOpen is ok) > 2. for some reason we have to fail qemuConnectAgent ( > virSecurityManagerClearSocketLabel failed for example) > and set error flag. > 3. EOF is triggered and clears the error flag (in new version). > > This is not possible. qemuConnectAgent will eventually remove > agent handle from the io loop before setting the flag on failure and > EOF can not be delivered. I was not correct. io loop event dispatching drops io loop lock on event delivering thus EOF can be delivered. So we have 2 cases for the race when agent is started and immediately gets EOF. A. qemuConnectAgent wins 1. virSecurityManagerClearSocketLabel fails, priv->agent stays NULL, error is set. 2. qemuProcessHandleAgentEOF exits early on priv->agent check. B. qemuProcessHandleAgentEOF wins 1. qemuProcessHandleAgentEOF exits early on priv->agent check. (Looks like debug message will be misleading in this case) 2. qemuConnectAgent proceeds as usual. In both cases clearing the error flag in qemuProcessHandleAgentEOF does not changes anything. > > This subtle case is another reason why we should throw away this flag. > It adds too much complexity to analysis. Patch 5 unbinds the flag from this case, > case of agent failed to start up. > > As to case 2. > > This is one of the patch targets. If there is an error set by qemuProcessHandleAgentError > we need to clean it up when agent is cleaned up. > >> >>> >>> virObjectUnlock(vm); >>> return; >>> >> >> So given all that - I'm not 100% clear how this really fixes the issue >> you say. Although, I have to wonder if the bug you're chasing is >> perhaps that qemuConnectAgent doesn't clear agentError at the right time >> as patch 3 attempts to fix. > > Yep. The flag is not cleared. And it can be addressed both ways: > clear on initialization to protect from "unclean" clean up, > second - make clean up "clean". > >> >> After reading through patch 4 - I think Patch 4 and a slightly modified >> patch 3 will fix what you're seeing, but I'd only know for sure given >> more details about what was seen. > > Patch 4 is pure refactoring IIRC and doesn't address the issue. > > Nikolay > > -- > libvir-list mailing list > libvir-list@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/libvir-list > -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list