Hi, On Wed, Apr 30, 2014 at 08:11:38PM +0300, Uri Lublin wrote: > Hi Christophe, > > I don't understand what exactly goes wrong and how this > patch fixes the race. > > > On 04/01/2014 04:58 PM, Christophe Fergeau wrote: > >During seamless migration, after switching host, if a client was connected > >during the migration, it will have data to send back to the new > >qemu/spice-server instance. This is handled through MIGRATE_DATA messages. > >SPICE char devices use such MIGRATE_DATA messages to restore their state. > > > >However, the MIGRATE_DATA message can arrive any time after the new qemu > >instance has started, this can happen before or after the SPICE char > >devices have been created. In order to handle this, if the migrate data > >arrives early, it's stored in reds->agent_state.mig_data, and > >attach_to_red_agent() will restore the agent state as appropriate. > > Adding more information -- I think the race here is between qemu-kvm > migrating the virtio-device state (with main/migration thread) vs spice > migrating > "agent" channel state. What do you call 'spice migrating "agent"" channel state" ? Is it the "MIGRATE DATA" spice message I I mention or something else? > > The virtio-device state load function, calls (if agent service is running in > the guest) > virtio_serial_load -> > qemu_char_fe_open -> > spice_chr_guest_open -> > vmc_register_interface -> > qemu_spice_add_interface -> > spice_server_add_interface -> > spice_server_char_device_add_interface -> > attach_to_red_agent > > >Unfortunately this does not work as expected as expected. If > >attach_to_red_agent() is called before the MIGRATE_DATA message reaches the > >server, all goes well, but if MIGRATE_DATA reaches the server before > >attach_to_red_agent() gets called, then some assert() get triggered in > >spice_char_device_state_restore(): > > > >((null):32507): Spice-ERROR **: char_device.c:937:spice_char_device_state_restore: assertion `dev->num_clients == 1 && dev->wait_for_migrate_data' failed > >Thread 3 (Thread 0x7f406b543700 (LWP 32543)): > >Thread 2 (Thread 0x7f40697ff700 (LWP 32586)): > >Thread 1 (Thread 0x7f4079b45a40 (LWP 32507)): > > If we split this into 2 asserts, one for num_clients and the other to > wait_for_migrate_data, we'll know which one's at fault. Both were at fault. (dev->num_client is 0 and dev->wait_for_migrate_data is FALSE). > > >What happens is that dev->wait_for_migrate_data is unconditionally cleared when > >completing handling of the MIGRATE_DATA message, so it's FALSE when > >spice_char_device_state_restore() is called. > > dev->wait_for_migrate_data is cleared upon > client_remove/device_reset/state_restore and in state_restore it's > only after it passed the assert mentioned above. The race happens when we receive a MIGRATE_DATA message from the client before the codepath you describe above is hit. reds_handle_migrate_data() will get called, and this will clear dev->wait_for_migrate_data. I forgot which codepaths were triggered though when the bug occurs, but this ends up clearing wait_for_migrate_data. I can look closer at this codepath if you want after reproducing. > > > Moreover, dev->num_clients is > >also 0 as this is only increased by spice_char_device_start() which > >spice_server_char_device_add_interface() calls after > >attach_to_red_agent()/spice_char_device_state_restore() have been called. > > dev->num_clients is set to 1 in spice_char_device_client_add() called > from spicevmc_connect. So as long as the client did not disconnect > it should be good. > With seamless migration that happens even before (qemu-kvm) > migration starts. Once again I will have to check exactly what's going on as dev->num_clients is definitely 0 when attach_to_red_agent() gets called. > > >This commit changes the logic in spice_server_char_device_add_interface(), > >and when there is migrate data pending in reds->agent_state.mig_data, we > >only attempt to restore it after successfully initializing/starting the > >needed char device (which will ensure that dev->num_clients is not 0). > > I think that if the client is connected we should have dev->num_clients set > to 1. Ah, if this is the real bug, I'll have to look again at all that ;) When I hit the bug, there is no client attached when attach_to_red_agent is called. Christophe _______________________________________________ Spice-devel mailing list Spice-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/spice-devel