On Tue, Feb 11, 2025 at 14:20:27 +0100, Peter Krempa wrote: > On incoming migration qemu doesn't activate the block graph nodes right > away. This is to properly facilitate locking of the images. > > The block nodes are normally re-activated when starting the CPUs after > migration, but in cases (e.g. when a paused VM was migrated) when the VM > is left paused the block nodes are not re-activated by qemu. > > This means that blockjobs which would want to write to an existing > backing chain member would fail. Generally read-only jobs would succeed > with older qemu's but this was not intended. > > Instead with new qemu you'll always get an error if attempting to access > a inactive node: > > error: internal error: unable to execute QEMU command 'blockdev-mirror': Inactive 'libvirt-1-storage' can't be a backing child of active '#block052' > > This is the case for explicit blockjobs (virsh blockcopy) but also for > non shared-storage migration (virsh migrate --copy-storage-all). > > Since qemu now provides 'blockdev-set-active' QMP command which can > on-demand re-activate the nodes we can re-activate them in similar cases > as when we'd be starting vCPUs if the VM weren't left paused. > > The only exception is on the source in case of a failed post-copy > migration as the VM already ran on destination so it won't ever run on > the source even when recovered. > > Resolves: https://issues.redhat.com/browse/RHEL-78398 > Signed-off-by: Peter Krempa <pkrempa@xxxxxxxxxx> > --- > src/qemu/qemu_migration.c | 46 +++++++++++++++++++++++++++++++++++---- > 1 file changed, 42 insertions(+), 4 deletions(-) > > diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c > index 53bbbee629..159521fbf5 100644 > --- a/src/qemu/qemu_migration.c > +++ b/src/qemu/qemu_migration.c ... > @@ -238,11 +271,14 @@ qemuMigrationSrcRestoreDomainState(virQEMUDriver *driver, virDomainObj *vm) > > if (preMigrationState != VIR_DOMAIN_RUNNING || > state != VIR_DOMAIN_PAUSED || > - reason == VIR_DOMAIN_PAUSED_POSTCOPY_FAILED) > - return; > + reason == VIR_DOMAIN_PAUSED_POSTCOPY_FAILED) { > + if (reason == VIR_DOMAIN_PAUSED_IOERROR) > + VIR_DEBUG("Domain is paused due to I/O error, skipping resume"); > + > + /* Don't reactivate disks on post-copy failure */ > + if (reason != VIR_DOMAIN_PAUSED_POSTCOPY_FAILED) > + qemuMigrationBlockNodesReactivate(vm, VIR_ASYNC_JOB_MIGRATION_OUT); > > - if (reason == VIR_DOMAIN_PAUSED_IOERROR) { > - VIR_DEBUG("Domain is paused due to I/O error, skipping resume"); > return; > } > The new condition is incorrect. Previously the reason == VIR_DOMAIN_PAUSED_IOERROR part was effectively in the "else" branch of the first condition while you moved it to the "then" part. So when expanded it was similar to if (preMigrationState == VIR_DOMAIN_RUNNING && state == VIR_DOMAIN_PAUSED && reason == VIR_DOMAIN_PAUSED_IOERROR) ... This combination will not be properly handled by the new condition and the domain would be resumed. > @@ -6795,6 +6831,8 @@ qemuMigrationDstFinishFresh(virQEMUDriver *driver, > The attempt to resume the domain may fail in the missing context above, shouldn't we call Reactivate in such case as well? And in fact the same applies to qemuMigrationSrcRestoreDomainState where resuming might fail as well. > if (*inPostCopy) > *doKill = false; > + } else { > + qemuMigrationBlockNodesReactivate(vm, VIR_ASYNC_JOB_MIGRATION_IN); > } > > if (mig->jobData) { Jirka