Re: [RFC 0/2] Fix detection of slow guest shutdown

Alex Williamson <alex.williamson@xxxxxxxxxx> · Fri, 3 Aug 2018 10:39:55 -0600

On Fri,  3 Aug 2018 08:29:39 +0200
Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx> wrote:

> Hi,
> I was recently looking into a case which essentially looked like this:
>   1. virsh shutdown guest
>   2. after <1 second the qemu process was gone from /proc/
>   3. but libvirt spun in virProcessKillPainfully because the process
>      was still reachable via signals
>   4. virProcessKillPainfully eventually fails after 15 seconds and the
>      guest stays in "in shutdown" state forever
> 
> This is not one of the common cases I've found for
> virProcessKillPainfully to break:
> - bad I/O e.g. NFS gets qemu stuck
> - CPU overload stalls things to death
> - qemu not being reaped (by init)
> All of the above would have the process still available in /proc/<pid>
> as Zombie or in uninterruptible sleep, but that is not true in my case.
> 
> It turned out that the case was dependent on the amount of hostdev resources
> passed to the guest. Debugging showed that with 8 and more likely 16 GPUs
> passed it took ~18 seconds from SIGTERM to "no more be reachable with signal 0".
> I haven't conducted much more tests but stayed on the 16 GPU case, but
> I'm rather sure more devices might make it take even longer.

If it's dependent on device assignment, then it's probably either
related to unmapping DMA or resetting devices.  The former should scale
with the size of the VM, not the number of devices attached.  The
latter could increase with each device.  Typically with physical GPUs
we don't have a function level reset mechanism so we need to do a
secondary bus reset on the upstream bridge to reset the device, this
requires a 1s delay to let the bus settle after reset.  So if we're
gated by these sorts of resets, your scaling doesn't sound
unreasonable, though I'm not sure how these factor into the process
state you're seeing.  I'd also be surprised if you have a system that
can host 16 physical GPUs, so maybe this is a vGPU example?  Any mdev
device should provide a reset callback for roughly the equivalent of a
function level reset.  Implementation of such a reset would be vendor
specific.  Thanks,

Alex

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list