Libvirt recently introduced a change to the way it does 'save to file' with QEMU. Historically QEMU has a 32MB/s I/O limit on migration by default. When saving to file, we didn't want any artificial limit, but rather to max out the underlying storage. So when doing save to file, we set a large bandwidth limit (INT64_MAX / (1024 * 1024)) so it is effectively unlimited. After doing this, we discovered that the QEMU monitor was becoming entirely blocked. It did not even return from the 'migrate' command until migration was complete despite the 'detach' flag being set. This was a bug in libvirt, because we passed a plain file descriptor which does not support EAGAIN. Thank you POSIX. Libvirt has another mode where it uses an I/O helper command so get O_DIRECT, and in this mode we pass a pipe() FD to QEMU. After ensuring that this pipe FD really does have O_NONBLOCK set, we still saw some odd behaviour. I'm not sure whether what I describe can neccessarily be called a QEMU bug, but I wanted to raise it for discussion anyway.... The sequence of steps is - libvirt sets qemu migration bandwidth to "unlimited" - libvirt opens a pipe() and sets O_NONBLOCK on the write end - libvirt spawns libvirt-iohelper giving it the target file on disk, and the read end of the pipe - libvirt does 'getfd migfile' monitor command to give QEMU the write end of the pipe - libvirt does 'migrate fd:migfile -d' to run migration - In parallel - QEMU is writing to the pipe (which is non-blocking) - libvirt_helper reading the pipe & writing to disk with O_DIRECT The initial 'migrate' command detaches into the background OK, and libvirt can enter its loop doing 'query-migrate' frequently to monitor progress. Initially this works fine, but at some points during the migration, QEMU will get "stuck" for a very long time and not respond to the monitor (or indeed the mainloop at all). These blackouts are anywhere from 10 to 20 seconds long. Using a combination of systemtap, gdb and strace I managed to determine out the following - Most of the qemu_savevm_state_iterate() calls complete in 10-20 ms - Reasonably often a qemu_savevm_state_iterate() call takes 300-400 ms - Fairly rarely a qemu_savevm_state_iterate() call takes 10-20 *seconds* - I can see EAGAIN from the FD QEMU is migrating from - hence most of the iterations are quite short. - In the 10-20 second long calls, no EAGAIN is seen for the entire period. - The host OS in general is fairly "laggy", presumably due to the high rate of direct I/O being performed by the I/O helper, and bad schedular tuning IIUC, there are two things which will cause a qemu_savevm_state_iterate() call to return - Getting EAGAIN on the migration FD - Hitting the max bandwidth limit We have set effectively unlimited bandwidth, so everything relies on the EAGAIN behaviour. If the OS is doing a good job at scheduling processes & I/O, then this seems to work reasonably well. If the OS is highly loaded and becoming less responsive to scheduling apps, then QEMU gets itself into a bad way. What I think is happening is that the OS is giving too much time to the I/O helper process that is reading the other end of the pipe given to QEMU, and then doing the O_DIRECT to disk. Thus in the shorter-than-normal windows of time when QEMU itself is scheduled by the OS, the pipe is fairly empty, so QEMU does not see EAGAIN for a very long period of wallclock time. So we get into a case where QEMU sees 10-20 second gaps betweeen iterations of the main loop. Having a non-inifinite max-bandwidth for the migration, would likely mitigate this to some extent, but I still think it'd be possible to get QEMU into these pathelogical conditions under high load for a host. Is this a scenario we need to worry about for QEMU ? On the one hand it seems like it is a rare edge case in OS behaviour overall. On the other hand, times when a host is highly loaded and non-responsive are exactly the times when a mgmt app might want to save a guest to disk, or migrate it elsewhere. Which would mean we need QEMU to behave as well as possible in these adverse conditions. Thus should we consider having an absolute bound on the execution time of qemu_savevm_state_iterate(), independant of EAGAIN & bandwidth limits, to ensure the main loop doesn't get starved ? Or perhaps moving migration to a separate thread, out of the mainloop is what we need to strive for ? Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list