[copy Dave, for real] On Mon, Jun 06, 2022 at 10:32:03AM -0400, Peter Xu wrote: > [copy Dave] > > On Mon, Jun 06, 2022 at 12:29:39PM +0100, Daniel P. Berrangé wrote: > > On Wed, Jun 01, 2022 at 02:50:21PM +0200, Jiri Denemark wrote: > > > QEMU keeps guest CPUs running even in postcopy-paused migration state so > > > that processes that already have all memory pages they need migrated to > > > the destination can keep running. However, this behavior might bring > > > unexpected delays in interprocess communication as some processes will > > > be stopped until migration is recover and their memory pages migrated. > > > So let's make sure all guest CPUs are paused while postcopy migration is > > > paused. > > > --- > > > > > > Notes: > > > Version 2: > > > - new patch > > > > > > - this patch does not currently work as QEMU cannot handle "stop" > > > QMP command while in postcopy-paused state... the monitor just > > > hangs (see https://gitlab.com/qemu-project/qemu/-/issues/1052 ) > > > - an ideal solution of the QEMU bug would be if QEMU itself paused > > > the CPUs for us and we just got notified about it via QMP events > > > - but Peter Xu thinks this behavior is actually worse than keeping > > > vCPUs running > > > > I'd like to know what the rationale is here ? > > I think the wording here is definitely stronger than what I meant. :-) > > My understanding was stopping the VM may or may not help the guest, > depending on the guest behavior at the point of migration failure. And if > we're not 100% sure of that, doing nothing is the best we have, as > explicitly stopping the VM is something extra we do, and it's not part of > the requirements for either postcopy itself or the recovery routine. > > Some examples below. > > 1) If many of the guest threads are doing cpu intensive work, and if the > needed pageset is already migrated, then stopping the vcpu threads means > they could have been running during this "downtime" but we forced them not > to. Actually if the postcopy didn't pause immediately right after switch, > we could very possibly migrated the workload pages if the working set is > not very large. > > 2) If we're reaching the end of the postcopy phase and it paused, most of > the pages could have been migrated already. So maybe only a few or even > none thread will be stopped due to remote page faults. > > 3) Think about kvm async page fault: that's a feature that the guest can do > to yield the guest thread when there's a page fault. It means even if some > of the page faulted threads got stuck for a long time due to postcopy > pausing, the guest is "smart" to know it'll take a long time (userfaultfd > is a major fault, and as long as KVM gup won't get the page we put the page > fault into async pf queue) then the guest vcpu can explicitly schedule() > the faulted context and run some other threads that may not need to be > blocked. > > What I wanted to say is I don't know whether assuming "stopping the VM will > be better than not doing so" will always be true here. If it's case by > case I feel like the better way to do is to do nothing special. > > > > > We've got a long history knowing the behaviour and impact when > > pausing a VM as a whole. Of course some apps may have timeouts > > that are hit if the paused time was too long, but overall this > > scenario is not that different from a bare metal machine doing > > suspend-to-ram. Application impact is limited & predictable and > > genrally well understood. > > My other question is, even if we stopped the VM then right after we resume > the VM won't many of those timeout()s trigger as well? I think I asked > similar question to Jiri and the answer at that time was that we could have > not called the timeout() function, however I think it's not persuasive > enough as timeout() is the function that should take the major time so at > least we're not sure whether we'll be on it already. > > My understanding is that a VM can work properly after a migration because > the guest timekeeping will gradually sync up with the real world time, so > if there's a major donwtime triggered we can hardly make it not affecting > the guest. What we can do is if we know a software is in VM context we > should be robust on the timeout (and that's at least what I do on programs > even on bare metal because I'd assume the program be run on an extremely > busy host). > > But I could be all wrong on that, because I don't know enough on the whole > rational of the importance of stopping the VM in the past. > > > > > I don't think we can say the same about the behaviour & impact > > on the guest OS if we selectively block execution of random > > CPUs. An OS where a certain physical CPU simply stops executing > > is not a normal scenario that any application or OS is designed > > to expect. I think the chance of the guest OS or application > > breaking in a non-recoverable way is high. IOW, we might perform > > post-copy recovery and all might look well from host POV, but > > the guest OS/app is none the less broken. > > > > The overriding goal for migration has to be to minimize the > > danger to the guest OS and its applications, and I think that's > > only viable if either the guest OS is running all CPUs or no > > CPUs. > > I agree. > > > > > The length of outage for a CPU when post-copy transport is broken > > is potentially orders of magnitude larger than the temporary > > blockage while fetching a memory page asynchronously. The latter > > is obviously not good for real-time sensitive apps, but most apps > > and OS will cope with CPUs being stalled for 100's of milliseconds. > > That isn't the case if CPUs get stalled for minutes, or even hours, > > at a time due to a broken network link needing admin recovery work > > in the host infra. > > So let me also look at the issue on having vm stop hanged, no matter > whether we'd like an explicit vm_stop that hang should better be avoided > from libvirt pov. > > Ideally it could be avoided but I need to look into it. I think it can be > that the vm_stop was waiting for other vcpus to exit to userspace but those > didn't really come alive after the SIG_IPI sent to them (in reality that's > SIGUSR1; and I'm pretty sure all vcpu threads can handle SIGKILL.. so maybe > I need to figure out where got it blocked in the kernel). > > I'll update either here or in the bug that Jiri opened when I got more > clues out of it. > > Thanks, > > -- > Peter Xu -- Peter Xu