Re: [PATCH 5/5] qemu: Avoid duplicate resume events and state changes

John Ferlan <jferlan@xxxxxxxxxx> · Wed, 19 Sep 2018 16:04:13 -0400

On 09/12/2018 08:55 AM, Jiri Denemark wrote:
> The only place where VIR_DOMAIN_EVENT_RESUMED is generated is the RESUME

I assume you meant "should be generated"

> event handler to make sure we don't generate duplicate events or state
> changes. In the worse case the duplicity can revert or cover changes
> done by other event handlers.
> 
> For example, after QEMU sent RESUME, BLOCK_IO_ERROR, and STOP events
> we could happily mark the domain as running and report
> VIR_DOMAIN_EVENT_RESUMED to registered clients.
> 

This patch removes the non QEMU event handling resume processing such as

   1. User initiated domain resume
   2. Resume after revert from snapshot
   3. Resume on migration source after failure detected
   4. Resume on migration destination once finished
   5. Resume after Shutdown/Reboot or Panic when guest is configured to
process a fake reboot.

deferring completely to the resume event QEMU will send as a result of
the CPUs being successfully started.

...
My caveats:

 1. I think that list is correct - I may have miss-worded #5 - I forget
exactly how that code works until I'm debugging it ;-)
 2. The CPUs being successfully restarted is I think the "cont" command
being successful.

> https://bugzilla.redhat.com/show_bug.cgi?id=1612943
> 
> Signed-off-by: Jiri Denemark <jdenemar@xxxxxxxxxx>
> ---
>  src/qemu/qemu_driver.c    | 13 -------------
>  src/qemu/qemu_migration.c | 36 ++++++------------------------------
>  src/qemu/qemu_process.c   | 10 ++++------
>  3 files changed, 10 insertions(+), 49 deletions(-)
> 

What I type above could be "massaged" into the commit message - so it's
"easier" at a glance to understand what's going on. It's of course my
wording and making sure I've properly understood what happening without
being able to stop by your cube and ask directly ;-).

[...]

> diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
> index 825a9d399b..4771f26938 100644
> --- a/src/qemu/qemu_migration.c
> +++ b/src/qemu/qemu_migration.c

[...]

>  
> @@ -5074,13 +5060,8 @@ qemuMigrationDstFinish(virQEMUDriverPtr driver,
>                  goto endjob;
>          }
>  
> -        if (inPostCopy) {
> +        if (inPostCopy)
>              doKill = false;
> -            event = virDomainEventLifecycleNewFromObj(vm,
> -                                        VIR_DOMAIN_EVENT_RESUMED,
> -                                        VIR_DOMAIN_EVENT_RESUMED_POSTCOPY);
> -            virObjectEventStateQueue(driver->domainEventState, event);
> -        }
>      }
>  
>      if (mig->jobInfo) {
> @@ -5111,11 +5092,6 @@ qemuMigrationDstFinish(virQEMUDriverPtr driver,
>  
>      dom = virGetDomain(dconn, vm->def->name, vm->def->uuid, vm->def->id);
>  
> -    event = virDomainEventLifecycleNewFromObj(vm,
> -                                              VIR_DOMAIN_EVENT_RESUMED,
> -                                              VIR_DOMAIN_EVENT_RESUMED_MIGRATED);
> -    virObjectEventStateQueue(driver->domainEventState, event);
> -
>      if (virDomainObjGetState(vm, NULL) == VIR_DOMAIN_PAUSED) {
>          virDomainObjSetState(vm, VIR_DOMAIN_PAUSED, VIR_DOMAIN_PAUSED_USER);
>          event = virDomainEventLifecycleNewFromObj(vm,

For this path 2 events were generated if @inPostCopy == True - one right
after the qemuProcessStartCPUs for RESUMED_POSTCOPY and then one for
SUSPENDED_PAUSED once qemuMigrationDstWaitForCompletion occurred.

IIUC, the changes here when @inPostCopy == true would only generate the
POSTCOPY event, but not the MIGRATED event. Or is there something subtle
I'm missing. Will the post copy processing generate two resume events?
If so, we perhaps should document that. If not, then perhaps this second
event that's being deleted needs to be moved inside that inPostCopy
condition just above with the note that this one case is "abnormal" (or
you could just do the *StartCPUs again with the RESUMED_MIGRATED and
close your eyes ;-)).

Everything else is fine - it's just this one case.

John

[...]

Curious, for the future, would it be useful to change the FakeReboot
path to generate a different detail reason? It's hard to distinguish
between someone using virsh resume (or recovery from Save, Dump,
Snapshot failure) from that fake reboot path - at least from just the
event. I guess it seems VIR_DOMAIN_EVENT_RESUMED_UNPAUSED is just too
generic since it's more than just a "normal resume due to admin
unpause", but that I would think would be extra given the "scope" of
this particular problem.

Second curiosity... would it be useful/easy to generate an event on the
failure scenario for StartCPUs? Knowing from whence we were called, we
should be able to generate a suspended event rather than having the
callers decide.

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list