Re: [libvirt PATCH 06/80] qemu: Keep domain running on dst on failed post-copy migration

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Wed, 11 May 2022 11:39:29 +0100

On Wed, May 11, 2022 at 10:48:10AM +0200, Peter Krempa wrote:
> On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote:
> > There's no need to artificially pause a domain when post-copy fails. The
> > virtual CPUs may continue running, only the guest tasks that decide to
> > read a page which has not been migrated yet will get blocked.
> 
> IMO not pausing the VM is a policy decision (same way as pausing it was
> though) and should be user-configurable at migration start.
> 
> I can see that users might want to prevent a half-broken VM from
> executing until it gets attention needed to fix it, even when it's safe
> from a "theoretical" standpoint.

It isn't even safe from a theoretical standpoint though.

Consider 2 processes in a guest that are communicating with each
other. 1 gets blocked on a page rea due to broken post copy, but
we leave the guest running.  The other process sees no progress
from the blocked process and/or hits time timeout and throws an
error. As a result the guest application workload ends up
completely dead, even if we later recover the the postcopy
migration.

Migration needs to strive to be transparent to the guest workload
and IMHO having the guest workload selectively executing and
selectively blocked is not transparent enough. It should require
a user opt-in for this kind of behaviour.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|