On Wed, May 06, 2015 at 10:27:50AM -0600, Eric Blake wrote: > On 05/06/2015 02:46 AM, Michal Privoznik wrote: > > https://bugzilla.redhat.com/show_bug.cgi?id=890648 > > > > So, imagine you've issued an API that involves guest agent. For > > instance, you want to query guest's IP addresses. So the API acquires > > QUERY_JOB, locks the guest agent and issues the agemt command. > > s/agemt/agent/ > > > However, for some reason, guest agent replies to initial ping > > correctly, but then crashes tragically while executing real command > > (in this case guest-network-get-interfaces). Since initial ping went > > well, libvirt thinks guest agent is accessible and awaits reply to the > > real command. But it will never come. What will is a monitor event. > > The monitor event is telling you that your side of the socket is still > valid, but the guest's side of the agenet is no longer connected. > > > Our handler (processSerialChangedEvent) will try to acquire > > MODIFY_JOB, which will fail obviously because the other thread that's > > executing the API already holds a job. So the event handler exits > > early, and the QUERY_JOB is never released nor ended. > > > > I don't understand why, but qemu should in fact instead of sending a > > event on the monitor close the socket. > > Reads awkwardly; I think you were trying to say: > > qemu should in fact be closing the socket instead of sending an event > > But that doesn't make sense technically. The fact that the agent is no > longer open in the guest is no reason to close the monitor socket, and > closing the agent socket is inappropriate in case the guest restarts the > guest agent and reopens the port. Thus, listening to events on the > monitor socket about whether the agent socket is connected or dangling > is sufficient for us to know if the agent will be responsive or not. > > > But leave that for a different > > discussion. Since qemu is not doing that, we must work around it and > > close and unregister the agent's FD from the even loop ourselves. This > > will wake up all callers stuck in waiting for agent's reply. > > Why do we have to kill the agent's fd? If the guest restarts the agent, > how will we reconnect to it if we have killed our agent fd? > > > Hopefully, it will make them finish the job too and even handler can > > acquire its job later. > > I agree that we want to detect agent death events as a reason to mark > the agent unresponsive, in such a way that any other job pending on the > agent then gracefully aborts, but I'm not sure that this is the right > solution. Agreed, I think it makes more sense to just mark agentError = true when the agent is not running. Or have an bool agentRunning flag to let us report clearer error message. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list