On Wed, May 11, 2022 at 01:03:43PM +0200, Peter Krempa wrote: > On Wed, May 11, 2022 at 11:39:29 +0100, Daniel P. Berrangé wrote: > > On Wed, May 11, 2022 at 10:48:10AM +0200, Peter Krempa wrote: > > > On Tue, May 10, 2022 at 17:20:27 +0200, Jiri Denemark wrote: > > > > There's no need to artificially pause a domain when post-copy fails. The > > > > virtual CPUs may continue running, only the guest tasks that decide to > > > > read a page which has not been migrated yet will get blocked. > > > > > > IMO not pausing the VM is a policy decision (same way as pausing it was > > > though) and should be user-configurable at migration start. > > > > > > I can see that users might want to prevent a half-broken VM from > > > executing until it gets attention needed to fix it, even when it's safe > > > from a "theoretical" standpoint. > > > > It isn't even safe from a theoretical standpoint though. > > > > Consider 2 processes in a guest that are communicating with each > > other. 1 gets blocked on a page rea due to broken post copy, but > > we leave the guest running. The other process sees no progress > > from the blocked process and/or hits time timeout and throws an > > error. As a result the guest application workload ends up > > completely dead, even if we later recover the the postcopy > > migration. > > IMO you have to deal with this scenario in a reduced scope anyways when > opting into using post-copy. > > Each page transfer is vastly slower than the comparable access into > memory, so if the 'timeout' portion is implied to be on the same order > of magnitde of memory access latency then your software is going to have > a very bad time when being migrated in post-copy mode. If the link gets > congested ... then it's even worse. That's very different likely order of magnitudes though. A "slow" page access in post-copy is $LOW seconds. A blocked process due to a broken post-copy connection is potentially $HIGH minutes long if the infra takes a long time to fix. A page access taking a seconds rather than microseconds isn't going to trip up many app level timeouts IMHO. A process blocked for many minutes is highly likely to trigger app level timeouts. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|