On 05/03/2018 05:26 AM, Nikolay Shirokovskiy wrote: > > > On 01.05.2018 01:03, John Ferlan wrote: >> >> >> On 04/18/2018 10:44 AM, Nikolay Shirokovskiy wrote: >>> Clearing beingDestroyed right after acquiring job condition is racy. >>> It is not known when EOF will be delivired. Let's keep this flag >> >> delivered >> >>> set. This makes possible to make a clear distinction between monitor >>> errors/eofs and domain being destroyed in qemuDomainObjWait. >>> >>> Also let's move setting destroyed flag out of qemuProcessBeginStopJob >>> as the function is called from eof handler too. >>> >>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@xxxxxxxxxxxxx> >>> --- >>> src/qemu/qemu_domain.c | 4 ++-- >>> src/qemu/qemu_domain.h | 2 +- >>> src/qemu/qemu_driver.c | 8 +++++++- >>> src/qemu/qemu_process.c | 24 ++++++++++++------------ >>> 4 files changed, 22 insertions(+), 16 deletions(-) >>> >> >> This one didn't git am -3 apply as well, but I see you're changing >> DomainObjWait so that makes sense as to why. >> >> I do recall looking at this code at one time, but I cannot recall my >> exact conclusion given how qemuDomainObjBeginJobInternal allows the >> QEMU_JOB_DESTROY to be run, but then waits for whatever is taking place >> now to finish. >> >>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c >>> index 1f40ff1..431901c 100644 >>> --- a/src/qemu/qemu_domain.c >>> +++ b/src/qemu/qemu_domain.c >>> @@ -11913,9 +11913,9 @@ qemuDomainObjWait(virDomainObjPtr vm, unsigned long long until) >>> return -1; >>> } >>> >>> - if (!virDomainObjIsActive(vm)) { >>> + if (priv->destroyed) { >>> virReportError(VIR_ERR_OPERATION_FAILED, "%s", >>> - _("domain is not running")); >>> + _("domain is destroyed")); >>> return -1; >>> } >>> >>> diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h >>> index 494ed35..69112ea 100644 >>> --- a/src/qemu/qemu_domain.h >>> +++ b/src/qemu/qemu_domain.h >>> @@ -269,7 +269,7 @@ struct _qemuDomainObjPrivate { >>> bool agentError; >>> >>> bool gotShutdown; >>> - bool beingDestroyed; >>> + bool destroyed; >>> char *pidfile; >>> >>> virDomainPCIAddressSetPtr pciaddrs; >>> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c >>> index 03969d8..4356c0d 100644 >>> --- a/src/qemu/qemu_driver.c >>> +++ b/src/qemu/qemu_driver.c >>> @@ -2227,7 +2227,13 @@ qemuDomainDestroyFlags(virDomainPtr dom, >>> state = virDomainObjGetState(vm, &reason); >>> starting = (state == VIR_DOMAIN_PAUSED && >>> reason == VIR_DOMAIN_PAUSED_STARTING_UP && >>> - !priv->beingDestroyed); >>> + !priv->destroyed); >>> + >>> + /* We need to prevent monitor EOF callback from doing our work (and >>> + * sending misleading events) while the vm is unlocked inside >>> + * BeginJob/ProcessKill API >>> + */ >>> + priv->destroyed = true; >> >> would this be the right place anyway? especially since you don't clear >> it in error paths after setting it. Once the job starts and we either >> goto cleanup or endjob on failure - how does a future call distinguish? > > We send SIGTERM/SIGKILL right away after this flag is set so even if > this API fails eventually keeping destroyed flag set looks good because we > send signal to qemu and good chances are qemu will die because of that. > > I guess it will be nicer then to move setting the flag to qemuProcessBeginStopJob > function to keep setting the flag and killing domain together. > > Anyway we can clear the flag on failure too. > >> >> Not sure this works conceptually. At least with the current model if >> DestroyFlags finally gets the job, it checks the domain active state, >> and goes to endjob after printing a message. So if while waiting, EOF >> occurred, there's no qemuProcessStop > > Not true. After sending signal to qemu to terminate EOF will occur but > handler will return right away because of beingDestroyed/destroyed flag > check and then after destroyFlags gets the job it will call qemuProcessStop. > > This is not the place I'm addressing with the patch. It is rather if destroyFlags > is called when we are migrating for example then the interrupted > API call can report 'domain is not active'/'some monitor error' rather > then much nicer 'domain is destroyed' without this patch. See the > scenario below. > >> >> Perhaps the only thing I recall wondering is whether we clear >> beingDestroyed too soon... If we're successful, we're still destroying >> but not until destroy is successful should the flag then be cleared >> while still holding the job. > > It does not matter if we clear the flag at the begin or the end of time > we holding the job because during the job nobody else who needs the job > too can progress. > > I propose to prolong setting the flag after destroy job is finished (and > thus rename flag too). Consider next scenario: > > - migrate is called and wait for migrationg complete event from qemu > - we call destroy and as destroy can run concurrently with migration > destroy gets the job and executes domain stop > - migrate awaiks on monitor EOF, beingDestroyed is cleared back at this > point. We can check for domain is active and give 'domain is not active' > error but 'domain is destroyed' is nicer thus we need 'destroyed' flag > > Nikolay > Honestly - hoping things will be clearer for the round of this. John [...] -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list