Re: [PATCH] migration: Don't assert() if MIGRATE_DATA comes before attaching the agent

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Christophe,

I don't understand what exactly goes wrong and how this
patch fixes the race.


On 04/01/2014 04:58 PM, Christophe Fergeau wrote:
During seamless migration, after switching host, if a client was connected
during the migration, it will have data to send back to the new
qemu/spice-server instance. This is handled through MIGRATE_DATA messages.
SPICE char devices use such MIGRATE_DATA messages to restore their state.

However, the MIGRATE_DATA message can arrive any time after the new qemu
instance has started, this can happen before or after the SPICE char
devices have been created. In order to handle this, if the migrate data
arrives early, it's stored in reds->agent_state.mig_data, and
attach_to_red_agent() will restore the agent state as appropriate.

Adding more information -- I think the race here is between qemu-kvm
migrating the virtio-device state (with main/migration thread) vs spice migrating
"agent" channel state.

The virtio-device state load function, calls (if agent service is running in the guest)
  virtio_serial_load ->
   qemu_char_fe_open ->
     spice_chr_guest_open ->
       vmc_register_interface ->
         qemu_spice_add_interface ->
           spice_server_add_interface ->
             spice_server_char_device_add_interface ->
               attach_to_red_agent

Unfortunately this does not work as expected as expected. If
attach_to_red_agent() is called before the MIGRATE_DATA message reaches the
server, all goes well, but if MIGRATE_DATA reaches the server before
attach_to_red_agent() gets called, then some assert() get triggered in
spice_char_device_state_restore():

((null):32507): Spice-ERROR **: char_device.c:937:spice_char_device_state_restore: assertion `dev->num_clients == 1 && dev->wait_for_migrate_data' failed
Thread 3 (Thread 0x7f406b543700 (LWP 32543)):
Thread 2 (Thread 0x7f40697ff700 (LWP 32586)):
Thread 1 (Thread 0x7f4079b45a40 (LWP 32507)):

If we split this into 2 asserts, one for num_clients and the other to wait_for_migrate_data, we'll know which one's at fault.

What happens is that dev->wait_for_migrate_data is unconditionally cleared when
completing handling of the MIGRATE_DATA message, so it's FALSE when
spice_char_device_state_restore() is called.

dev->wait_for_migrate_data is cleared upon
client_remove/device_reset/state_restore and in state_restore it's
only after it passed the assert mentioned above.

  Moreover, dev->num_clients is
also 0 as this is only increased by spice_char_device_start() which
spice_server_char_device_add_interface() calls after
attach_to_red_agent()/spice_char_device_state_restore() have been called.

dev->num_clients is set to 1 in spice_char_device_client_add() called
from  spicevmc_connect. So as long as the client did not disconnect
it should be good.
With seamless migration that happens even before (qemu-kvm)
migration starts.

This commit changes the logic in spice_server_char_device_add_interface(),
and when there is migrate data pending in reds->agent_state.mig_data, we
only attempt to restore it after successfully initializing/starting the
needed char device (which will ensure that dev->num_clients is not 0).

I think that if the client is connected we should have dev->num_clients set to 1.


It also changes attach_to_red_agent() to handle the case when
reds->agent_state.mig_data is non-NULL to be the same as the case when
red_channel_waits_for_migrate_data() is TRUE. This ensures that
spice_char_device_client_add() gets called and that 'wait_for_data' is set
in the added device.

I think each handles a different scenario in that race:
reds->agent_state.mig_data -- migration data has already arrived.
spice_char_waits_for_migarte_data -- we are waiting for migration data
to arrive.


This fixes https://bugzilla.redhat.com/show_bug.cgi?id=1035184
---
Changes since v1:
- Move block calling reds_agent_state_restore() before calling
   spice_char_device_start() as this should be safer. This also
   groups all SUBTYPE_xxx special-casing together. I've tested that
   things are still working as expected after this move.


  server/reds.c | 25 ++++++++++++++-----------
  1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/server/reds.c b/server/reds.c
index 0390602..bd4fea1 100644
--- a/server/reds.c
+++ b/server/reds.c
@@ -2863,16 +2863,7 @@ static SpiceCharDeviceState *attach_to_red_agent(SpiceCharDeviceInstance *sin)
      state->read_filter.discard_all = FALSE;
      reds->agent_state.plug_generation++;
- if (reds->agent_state.mig_data) {
-        spice_assert(reds->agent_state.plug_generation == 1);
-        reds_agent_state_restore(reds->agent_state.mig_data);
-        free(reds->agent_state.mig_data);
-        reds->agent_state.mig_data = NULL;
-    } else if (!red_channel_waits_for_migrate_data(&reds->main_channel->base)) {
-        /* we will assoicate the client with the char device, upon reds_on_main_agent_start,
-         * in response to MSGC_AGENT_START */
-        main_channel_push_agent_connected(reds->main_channel);
-    } else {
+    if (red_channel_waits_for_migrate_data(&reds->main_channel->base) || reds->agent_state.mig_data) {
         spice_debug("waiting for migration data");
          if (!spice_char_device_client_exists(reds->agent_state.base, reds_get_client())) {
              int client_added;
@@ -2889,9 +2880,13 @@ static SpiceCharDeviceState *attach_to_red_agent(SpiceCharDeviceInstance *sin)
                  spice_warning("failed to add client to agent");
                  reds_disconnect();
              }
-
          }
+    } else {
+        /* we will associate the client with the char device, upon reds_on_main_agent_start,
+         * in response to MSGC_AGENT_START */
+        main_channel_push_agent_connected(reds->main_channel);
      }
+
      return state->base;
  }
@@ -2987,6 +2982,13 @@ static int spice_server_char_device_add_interface(SpiceServer *s,
          } else {
              dev_state = spicevmc_device_connect(char_device, SPICE_CHANNEL_PORT);
          }
+    } else if (strcmp(char_device->subtype, SUBTYPE_VDAGENT) == 0) {
+        if (reds->agent_state.mig_data) {
+               spice_assert(reds->agent_state.plug_generation == 1);
+               reds_agent_state_restore(reds->agent_state.mig_data);
+               free(reds->agent_state.mig_data);
+               reds->agent_state.mig_data = NULL;
+        }
      }

Since this if statement started with
   if (strcmp(char_device->subtype, SUBTYPE_VDAGENT) == 0)
and was followed by else if statements the lines added here will not be reached.


Thanks,
    Uri.
_______________________________________________
Spice-devel mailing list
Spice-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/spice-devel




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]     [Monitors]