[RFC][PATCH] Fix a deadlock in bi-directional p2p concurrent migration.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



If you try to execute two concurrent migrations p2p
from A->B and B->A, the two libvirtd's will deadlock
trying to perform the migrations.  The reason for this is
that in p2p migration, the libvirtd's are responsible for
making the RPC Prepare, Migrate, and Finish calls.  However,
they are currently holding the driver lock while doing so,
which basically guarantees deadlock in this scenario.

This patch fixes the situation by adding
qemuDomainObjEnterRemoteWithDriver and
qemuDomainObjExitRemoteWithDriver helper methods.  The Enter
take an additional object reference, then drops both the
domain object lock and the driver lock.  The Exit takes
both the driver and domain object lock, then drops the
reference.  Adding calls to these Enter and Exit helpers
around remote calls in the various migration methods
seems to fix the problem for me in testing.

I *believe* that this makes the situation safe, though
I'm not 100% certain.  The additional domain object reference
ensures that the domain object won't disappear while this
operation is happening.  The BeginJob that is called inside
of qemudDomainMigratePerform ensures that we can't execute a
second migrate (or shutdown, or save, etc) job while the
migration is active.  That being said, my understanding of the
locking is still a bit shaky, so I'd like some additional
confirmation that this is indeed safe.

Signed-off-by: Chris Lalancette <clalance@xxxxxxxxxx>
---
 src/qemu/qemu_driver.c |   39 ++++++++++++++++++++++++++++++++-------
 1 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index 5f2607c..e81af84 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -531,6 +531,21 @@ static void qemuDomainObjExitMonitorWithDriver(struct qemud_driver *driver, virD
     }
 }
 
+static void qemuDomainObjEnterRemoteWithDriver(struct qemud_driver *driver,
+                                               virDomainObjPtr obj)
+{
+    virDomainObjRef(obj);
+    virDomainObjUnlock(obj);
+    qemuDriverUnlock(driver);
+}
+
+static void qemuDomainObjExitRemoteWithDriver(struct qemud_driver *driver,
+                                              virDomainObjPtr obj)
+{
+    qemuDriverLock(driver);
+    virDomainObjLock(obj);
+    virDomainObjUnref(obj);
+}
 
 static int qemuCgroupControllerActive(struct qemud_driver *driver,
                                       int controller)
@@ -10753,9 +10768,11 @@ static int doTunnelMigrate(virDomainPtr dom,
         /* virStreamNew only fails on OOM, and it reports the error itself */
         goto cleanup;
 
+    qemuDomainObjEnterRemoteWithDriver(driver, vm);
     internalret = dconn->driver->domainMigratePrepareTunnel(dconn, st,
                                                             flags, dname,
                                                             resource, dom_xml);
+    qemuDomainObjExitRemoteWithDriver(driver, vm);
 
     if (internalret < 0)
         /* domainMigratePrepareTunnel sets the error for us */
@@ -10833,8 +10850,10 @@ cancel:
 
 finish:
     dname = dname ? dname : dom->name;
+    qemuDomainObjEnterRemoteWithDriver(driver, vm);
     ddomain = dconn->driver->domainMigrateFinish2
         (dconn, dname, NULL, 0, uri, flags, retval);
+    qemuDomainObjExitRemoteWithDriver(driver, vm);
 
 cleanup:
     if (client_sock != -1)
@@ -10873,16 +10892,20 @@ static int doNonTunnelMigrate(virDomainPtr dom,
     virDomainPtr ddomain = NULL;
     int retval = -1;
     char *uri_out = NULL;
+    int rc;
 
+    qemuDomainObjEnterRemoteWithDriver(driver, vm);
     /* NB we don't pass 'uri' into this, since that's the libvirtd
      * URI in this context - so we let dest pick it */
-    if (dconn->driver->domainMigratePrepare2(dconn,
-                                             NULL, /* cookie */
-                                             0, /* cookielen */
-                                             NULL, /* uri */
-                                             &uri_out,
-                                             flags, dname,
-                                             resource, dom_xml) < 0)
+    rc = dconn->driver->domainMigratePrepare2(dconn,
+                                              NULL, /* cookie */
+                                              0, /* cookielen */
+                                              NULL, /* uri */
+                                              &uri_out,
+                                              flags, dname,
+                                              resource, dom_xml);
+    qemuDomainObjExitRemoteWithDriver(driver, vm);
+    if (rc < 0)
         /* domainMigratePrepare2 sets the error for us */
         goto cleanup;
 
@@ -10899,8 +10922,10 @@ static int doNonTunnelMigrate(virDomainPtr dom,
 
 finish:
     dname = dname ? dname : dom->name;
+    qemuDomainObjEnterRemoteWithDriver(driver, vm);
     ddomain = dconn->driver->domainMigrateFinish2
         (dconn, dname, NULL, 0, uri_out, flags, retval);
+    qemuDomainObjExitRemoteWithDriver(driver, vm);
 
     if (ddomain)
         virUnrefDomain(ddomain);
-- 
1.7.1.1

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list


[Index of Archives]     [Virt Tools]     [Libvirt Users]     [Lib OS Info]     [Fedora Users]     [Fedora Desktop]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite News]     [KDE Users]     [Fedora Tools]