Re: [RFC 0/2] Fix detection of slow guest shutdown

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Mon, 6 Aug 2018 09:46:58 +0100

On Mon, Aug 06, 2018 at 07:20:10AM +0200, Christian Ehrhardt wrote:
> In that case I wonder what the libvirt community thinks of the proposed
> general "Pid is gone means we can assume it is dead" approach?

The key thing with the shutdown process is that we use the dissapperance of
the PID as the flag to indicate that it is safe to release any resources that
the PID was using. eg the hostdevs are now available for another guest to use.

I'd be concerned that if we looking /proc/$PID going away as the flag, then
we would be releasing the hostdevs for reuse, before the kernel has cleaned
them up. In the best case this would result in a 2nd guest failing to start
because the device was still in the case, in the worst case we could crash
the entire host (though I'd be hopeful vfio prevents that).

> An alternative would be to understand on the Kernel side why the PID is
> gone "too early" and fix that so it stays until fully cleaned up.
> But even then on the Libvirt side we would need the extended timeout values.

Yeah, looks like extended timeouts are unavoidable. The only real optimization
would be to pass an explicit timeout to the kill method, increasing it by 2
seconds for each hostdev that is assigned. That way we'll scale the timeout
up as we need, so don't have to predict the worst case number of assigned
devices.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list