On Wed, May 11, 2022 at 11:39:29 +0100, Daniel P. Berrangé wrote: > On Wed, May 11, 2022 at 10:48:10AM +0200, Peter Krempa wrote: > > On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote: > > > There's no need to artificially pause a domain when post-copy fails. The > > > virtual CPUs may continue running, only the guest tasks that decide to > > > read a page which has not been migrated yet will get blocked. > > > > IMO not pausing the VM is a policy decision (same way as pausing it was > > though) and should be user-configurable at migration start. > > > > I can see that users might want to prevent a half-broken VM from > > executing until it gets attention needed to fix it, even when it's safe > > from a "theoretical" standpoint. > > It isn't even safe from a theoretical standpoint though. > > Consider 2 processes in a guest that are communicating with each > other. 1 gets blocked on a page rea due to broken post copy, but > we leave the guest running. The other process sees no progress > from the blocked process and/or hits time timeout and throws an > error. As a result the guest application workload ends up > completely dead, even if we later recover the the postcopy > migration. IMO you have to deal with this scenario in a reduced scope anyways when opting into using post-copy. Each page transfer is vastly slower than the comparable access into memory, so if the 'timeout' portion is implied to be on the same order of magnitde of memory access latency then your software is going to have a very bad time when being migrated in post-copy mode. If the link gets congested ... then it's even worse. Obviously when the migration link breaks and you are getting unbounded wait for a page access it's worse even for other types of APPs. Anyways, does qemu support pausing the destination if the connection breaks? If no, the second best thing we can do is to pause it by libvirt, but it will still have caveats e.g. when libvirt is not around to pause it.