Re: [PATCH] Fix race condition reconnecting to vms & loading configs

"Daniel P. Berrange" <berrange@xxxxxxxxxx> · Mon, 28 Oct 2013 14:07:16 +0000

On Mon, Oct 28, 2013 at 09:17:16AM -0400, John Ferlan wrote:
> On 10/28/2013 07:52 AM, Daniel P. Berrange wrote:
> > From: "Daniel P. Berrange" <berrange@xxxxxxxxxx>
> > 
> > The following sequence
> > 
> >  1. Define a persistent QMEU guest
> 
> s/QMEU/QEMU
> 
> >  2. Start the QEMU guest
> >  3. Stop libvirtd
> >  4. Kill the QEMU process
> >  5. Start libvirtd
> >  6. List persistent guets
> 
> s/guets/guests
> 
> > 
> > At the last step, the previously running persistent guest
> > will be missing. This is because of a race condition in the
> > QEMU driver startup code. It does
> > 
> >  1. Load all VM state files
> >  2. Spawn thread to reconnect to each VM
> >  3. Load all VM config files
> > 
> > Only at the end of step 3, does the 'virDomainObjPtr' get
> > marked as "persistent". There is therefore a window where
> > the thread reconnecting to the VM will remove the persistent
> > VM from the list.
> > 
> > The easy fix is to simply switch the order of steps 2 & 3.
> > 
> > Signed-off-by: Daniel P. Berrange <berrange@xxxxxxxxxx>
> > ---
> >  src/qemu/qemu_driver.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> 
> The fix seems reasonable, although I immediately wondered why "at some
> time" it was considered OK to reconnect before being persistent flag was
> set for inactive guests. The condition/fault initially described
> includes a host reboot in the processing. In that case (I would assume)
> the restart of the guest would occur if autostart was set. The external
> action of the kill()'ing of a guest outside of libvirtd's control
> results in some unknown/unpredictable state for the guest. Is there
> something in that initial load that could detect this condition better?
> I tried following the steps without the patch, but on my host the guest
> was listed after the restart - so yes a timing condition - but what
> causes that timing condition.
> 
> Would setting the dom->persistent before the virObjectUnlock(dom) in
> virDomainObjListLoadAllConfigs() change the results?

No, I tried that and it didn't help. None the less that should be
fixed separately.

> Beyond that keeping the virConnectOpen() and qemuProcessReconnectAll()
> "together" after the loading of the inactive persistent configs seems to
> keep code flow more normal.  Whether that comes before or after the
> Snapshot/ManagedSave load is I suppose just an "implementation detail".
> 
> Also, other drivers follow the load running, reconnect, and load
> inactive/persistent configs. Should those have similar patches as well?

Historically it was safe because reconnecting to QEMU was
serialized. At some point we switched to using a separate
thread for reconnecting to QEMU which introduced a race.
The other drivers are still serialized IIUC.

Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list