On 24.06.2011 07:19, Daniel Veillard wrote: > On Wed, Jun 22, 2011 at 11:26:27AM -0600, Eric Blake wrote: >> On 06/22/2011 11:05 AM, Jiri Denemark wrote: >>> On Wed, Jun 22, 2011 at 16:47:18 +0100, Daniel P. Berrange wrote: >>>> If the QEMU process has been stopped (kill -STOP/gdb), or the >>>> QEMU process has live-locked itself, then we will never get a >>>> reply from the monitor. We should not wait forever in this >>>> case, but instead timeout after a reasonable amount of time. >>>> >>>> NB if the host has high CPU load, or a single monitor command >>>> intentionally takes a long time, then this will cause bogus >>>> failures. In the case of high CPU load, arguably the guest >>>> should have been migrated elsewhere, since you can't effectively >>>> manage guests on a host if QEMU is taking > 30 seconds to reply >>>> to simply commands. Since we use background migration, there >>>> should not be any commands which take significant time to >>>> execute any more >>> >>> The thing I'm most concerned about is that is far too easy to get into such >>> situations especially since disk cache subsystem in Linux kernel is not the >>> best thing in the world. While I agree that running guests on a loaded host is >>> not very clever and guests should rather be migrated elsewhere, such situation >>> doesn't have to be intentional. In other words, in case of a malfunction of >>> some kind (some processes go crazy, network disruptions, ...) QEMU may require >>> more than a timeout seconds to respond and we will penalize an innocent QEMU >>> process because we won't be able to control it anymore even though the issues >>> get fixed. >> >> Is there any way to measure time spent by the child process, rather than >> just relying on wall-time elapsed? That is, when libvirt hits 30 >> seconds of wall time in waiting for a monitor, can it then check whether >> the child process has accumulated any execution time (likely hung) vs. >> no execution time (likely a starved system situation), and only give up >> in the former case? > > Well a STOP'ed child process won't accumulate any execution time, > and you won't be able to discriminate just based on this, but I think > we should be able to poke linux to see if the process is in D state for > example and if we do mark the guest as non reponding then being able > to provide an useful error information upon the associated API failure > like > "Failed to contact domain: process stopped" > "Failed to contact domain: blocked on I/O" > "Failed to contact domain: process looping" > > would be a really good thing. That probing and reporting can be done > as a separate step though > > Daniel > To me this looks like solving the Halting problem. That means - for some cases we might be able to tell qemu will not answer anymore, but for others we will not. I agree if qemu (and thus libvirt API call) does not return in ~30 seconds, users get anxious, but it would be nice if we could send destroy to a unresponsive domains at least. Michal -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list