On 09/29/2015 10:06 AM, Shivaprasad bhat wrote: [...] >> Perhaps I should clarify my query-migrate has no timeout comment... It >> seems based on what I've read so far, the 'query-migrate' command >> started successfully, because if it hadn't we would have received a >> failure (as shown below). Thus libvirt has sent the command via the >> monitor and is waiting for a response (e.g. the virCondWait after the >> qemuMonitorSend in the trace below). The response isn't coming because >> either "A" qemu didn't send it back or "B" libvirt missed it - that >> should be determinable. >> >> There's a way to turn on debugging so the monitor dialog can be seen - >> via changes to /etc/libvirt/libvirtd.conf. I use : >> >> log_level = 1 >> log_filters="3:remote 4:event 3:json 3:rpc" >> log_outputs="1:file:/var/log/libvirt/libvirtd.log" >> >> But you may need to remove the "3:json" in order to see the dialog since >> that where it "feels like" the issue might be. Then start libvirtd in >> the debugger again. Once it's hung - you should be able to scan (eg, >> edit) the libvirtd.log file and search for the "query-migrate" command >> being sent and then follow the copious output looking for the presence >> of a returned command. If there is none, then something in qemu isn't >> returning the failure correctly and it would need to be fixed there I >> would think as opposed to throwing down the big hammer of closing the fd. > > Had a chance to run with your log settings. The query-migrate doesn't seem to > have a corresponding "return" in the logs. So as you say, there may be > a qemu bug > that is not returning a response when the fd is still open(as libvirt > didnt close it) but > no read actually happening there. I felt qemu can't sense the failure as the fd > is open, so posted this patch. Though, the qemu should return with the > current state > of migration as it sees instead of not returning at all. Hope we are > on the same page. > > Thanks, > Shivaprasad > Meant to respond yesterday but got wrapped up in other things. OK - so at least now it makes a bit more sense why purely adding a stream_abort didn't work - we're not getting a reply from the monitor. So there's perhaps 3 ways to "resolve" this issue (that come to my mind) 1. As you've done with the close() in the error path of qemuMigrationIOFunc when the virStream{Send|Finish} fails. Although this does feel like a work-around, I suppose since the tunnel is a libvirt created thing and qemu isn't aware of it, then it feels reasonable. Although that does make me wonder how qemu could be hung up. What would something like a "virsh qemu-monitor-command $dom '{"execute":"query-migrate"}' return when the source is hung? Or does it hang too? 2. Adding some sort of "timeout" logic in qemuMonitorSend (e.g. virCondWaitUntil instead of virCondWait) to handle when a command doesn't get a response. Not sure this is right either since it's not clear to me there is a "time" that "all" commands are guaranteed to be run in/by, especially async ones. 3. Dig into qemu to figure out why it's not returning anything for a migrate-status request. Currently a bit beyond what I've done, but I believe would require attaching into the running qemu process to see if there was some thread "stuck" somewhere "waiting" on something that won't return because the stream closed. Hopefully Jiri (or perhaps Daniel) could provide some other insights. John -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list