On Fri, 3 Aug 2018 08:29:39 +0200 Christian Ehrhardt <christian.ehrhardt@xxxxxxxxxxxxx> wrote: > Hi, > I was recently looking into a case which essentially looked like this: > 1. virsh shutdown guest > 2. after <1 second the qemu process was gone from /proc/ > 3. but libvirt spun in virProcessKillPainfully because the process > was still reachable via signals > 4. virProcessKillPainfully eventually fails after 15 seconds and the > guest stays in "in shutdown" state forever > > This is not one of the common cases I've found for > virProcessKillPainfully to break: > - bad I/O e.g. NFS gets qemu stuck > - CPU overload stalls things to death > - qemu not being reaped (by init) > All of the above would have the process still available in /proc/<pid> > as Zombie or in uninterruptible sleep, but that is not true in my case. > > It turned out that the case was dependent on the amount of hostdev resources > passed to the guest. Debugging showed that with 8 and more likely 16 GPUs > passed it took ~18 seconds from SIGTERM to "no more be reachable with signal 0". > I haven't conducted much more tests but stayed on the 16 GPU case, but > I'm rather sure more devices might make it take even longer. If it's dependent on device assignment, then it's probably either related to unmapping DMA or resetting devices. The former should scale with the size of the VM, not the number of devices attached. The latter could increase with each device. Typically with physical GPUs we don't have a function level reset mechanism so we need to do a secondary bus reset on the upstream bridge to reset the device, this requires a 1s delay to let the bus settle after reset. So if we're gated by these sorts of resets, your scaling doesn't sound unreasonable, though I'm not sure how these factor into the process state you're seeing. I'd also be surprised if you have a system that can host 16 physical GPUs, so maybe this is a vGPU example? Any mdev device should provide a reset callback for roughly the equivalent of a function level reset. Implementation of such a reset would be vendor specific. Thanks, Alex -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list