On Wed, Jun 01, 2022 at 14:50:21 +0200, Jiri Denemark wrote: > QEMU keeps guest CPUs running even in postcopy-paused migration state so > that processes that already have all memory pages they need migrated to > the destination can keep running. However, this behavior might bring > unexpected delays in interprocess communication as some processes will > be stopped until migration is recover and their memory pages migrated. > So let's make sure all guest CPUs are paused while postcopy migration is > paused. > --- > > Notes: > Version 2: > - new patch > > - this patch does not currently work as QEMU cannot handle "stop" > QMP command while in postcopy-paused state... the monitor just > hangs (see https://gitlab.com/qemu-project/qemu/-/issues/1052 ) Does it then somehow self-heal? Because if not ... > - an ideal solution of the QEMU bug would be if QEMU itself paused > the CPUs for us and we just got notified about it via QMP events > - but Peter Xu thinks this behavior is actually worse than keeping > vCPUs running > - so let's take this patch as a base for discussing what we should > be doing with vCPUs in postcopy-paused migration state > > src/qemu/qemu_domain.c | 1 + > src/qemu/qemu_domain.h | 1 + > src/qemu/qemu_driver.c | 30 +++++++++++++++++++++++++ > src/qemu/qemu_migration.c | 47 +++++++++++++++++++++++++++++++++++++++ > src/qemu/qemu_migration.h | 6 +++++ > src/qemu/qemu_process.c | 32 ++++++++++++++++++++++++++ > 6 files changed, 117 insertions(+) [...] > diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c > index 0314fb1148..58d7009363 100644 > --- a/src/qemu/qemu_migration.c > +++ b/src/qemu/qemu_migration.c > @@ -6831,6 +6831,53 @@ qemuMigrationProcessUnattended(virQEMUDriver *driver, > } > > > +void > +qemuMigrationUpdatePostcopyCPUState(virDomainObj *vm, > + virDomainState state, > + int reason, > + int asyncJob) > +{ > + virQEMUDriver *driver = QEMU_DOMAIN_PRIVATE(vm)->driver; > + int current; > + > + if (state == VIR_DOMAIN_PAUSED) { > + VIR_DEBUG("Post-copy migration of domain '%s' was paused, stopping guest CPUs", > + vm->def->name); > + } else { > + VIR_DEBUG("Post-copy migration of domain '%s' was resumed, starting guest CPUs", > + vm->def->name); > + } > + > + if (virDomainObjGetState(vm, ¤t) == state) { > + int eventType = -1; > + int eventDetail = -1; > + > + if (current == reason) { > + VIR_DEBUG("Guest CPUs are already in the right state"); > + return; > + } > + > + VIR_DEBUG("Fixing domain state reason"); > + if (state == VIR_DOMAIN_PAUSED) { > + eventType = VIR_DOMAIN_EVENT_SUSPENDED; > + eventDetail = qemuDomainPausedReasonToSuspendedEvent(reason); > + } else { > + eventType = VIR_DOMAIN_EVENT_RESUMED; > + eventDetail = qemuDomainRunningReasonToResumeEvent(reason); > + } > + virDomainObjSetState(vm, state, reason); > + qemuDomainSaveStatus(vm); > + virObjectEventStateQueue(driver->domainEventState, > + virDomainEventLifecycleNewFromObj(vm, eventType, > + eventDetail)); > + } else if (state == VIR_DOMAIN_PAUSED) { > + qemuProcessStopCPUs(driver, vm, reason, asyncJob); Then this will obviously break our ability to control qemu. If that is forever, then we certainly should not be doing this. In which case if we want to go ahead with pausing it ourselves, once qemu fixes the issue you've mentioned above, they need to also add a 'feature' flag into QMP which we can probe and avoid breaking qemu willingly. > + } else { > + qemuProcessStartCPUs(driver, vm, reason, asyncJob); > + } > +} > + > + > /* Helper function called while vm is active. */ > int > qemuMigrationSrcToFile(virQEMUDriver *driver, virDomainObj *vm,