On 07/09/2012 02:00 PM, Anthony Liguori wrote: >> with the fd:name approach, the sequence is: >> >> libvirt calls getfd:name1 over normal monitor >> qemu responds >> libvirt calls getfd:name2 over normal monitor >> qemu responds >> libvirt calls transaction around blockdev-snapshot-sync over normal >> monitor, using fd:name1 and fd:name2 >> qemu responds This general layout is true whether we rewrite all commands to understand fd:nnn (proposal 1) or whether we add new magic parsing (/dev/fd/nnn of proposal 3, or even /dev/fdset/nnn of proposal 5), all as called out in these messages: https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg00227.html https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01098.html >> >> but with -open-hook-fd, the approach would be: >> >> libvirt calls transaction >> qemu calls open(file1) over hook >> libvirt responds >> qemu calls open(file2) over hook >> libvirt responds >> qemu responds to the original transaction whereas this approach is quite different in semantics, but may indeed be easier for qemu to implement, at the expense of some more complexity on the part of libvirt. At the high level, I think both approaches have one thing in common: by refactoring all qemu code to go through qemu_open(), we can then implement our desired complexity (whether fd:nn, /dev/fd/nnn, /dev/fdset/nnn, or some other magic name parsing; or whether it is an rpc call over a second socket in parallel to the monitor socket) in just one location. Likewise, both approaches have to deal with libvirtd restarts (magic name parsing by changing an 'inuse' flag when the monitor detects EOF; rpc passing by failing a qemu_open() when the rpc socket detects EOF). >> >> The 'transaction' operation is thus blocked by the time it takes to do >> two intermediate opens over a second channel, which kind of defeats the >> purpose of making the transaction take effect with minimal guest >> downtime. > > How are you defining "guest down time"? > > It's important to note that code running in QEMU does not equate to > guest visible down time unless QEMU does an explicit vm_stop() which is > not happening here. > > Instead, a VCPU may become blocked *if* it attempts to acquire qemu_mute > while QEMU is holding it. > > If your concern is qemu_mutex being held while waiting for libvirt, it > would be fairly easy to implement a qemu_open_async() that dropped > allowed dropping back to the main loop and then calling a callback when > the open completes. > > It would be pretty trivial to convert qmp_transaction to use such a > command. In other words, remembering that transactions are divided into phases: phase 1 - prepare: obtain all needed fds (whether by pre-opening them via 'pass-fd' or other new 'getfd' relative, or whether by RPC calls); no guest downtime, and with cleanup that avoids any leaks on any failures phase 2 - commit: flush all devices and actually make the changes in qemu state to use the fds obtained in phase 1 and where the guest downtime (if any) is more likely due to flushing changes in phase 2 > > But this is all speculative. There's no reason to believe that an RPC > would have a noticable guest visible latency unless you assume there's > lot contention. I would strongly suspect that the bdrv_flush() is going > to be a much greater source of lock contention than the RPC would be. > An RPC is only bound by scheduler latency whereas synchronous disk I/O > is bound spinning a platter. > >> And libvirt code becomes a lot trickier to deal with the fact >> that two channels are in use, and that the channel that issued the >> 'transaction' command must block while the other channel for handling >> hooks must be responsive. > > All libvirt needs to do is listen on a socket and delegate access > according to a white list. Whatever is providing fd's needs to have no > knowledge of anythign other than what the guest is allowed to access > which shouldn't depend on an executing command. That's not quite accurate. What the guest is allowed to access should indeed change depending on the executing command. That is, if I start a guest with: base <- delta then I only want to permet O_RDONLY access to base but O_RDWR access to delta. If I then call 'blockdev-snapshot-sync', I want to change to the situation: base <- delta <- snap and give O_RDWR permissions to snap; it would also be nice if qemu attempts to reopen delta with O_RDONLY permissions (although from a trust perspective, libvirt must assume that delta is still O_RDWR because qemu may have been compromised and lie about the tightening of permissions); at any rate, depending on SELinux capabilities of the file, libvirt may be able to enforce no further writes to 'delta' by toggling a SELinux label (obviously, this should only be done after 'blockdev-snapshot-sync' completes). On the other hand, the user could decide to do a 'block-commit', to squash things into: base where base is now O_RDWR. But libvirt doesn't want to grant write-access to 'base' up-front, so the whitelist allowing O_RDWR access to base is indeed dependent on the user running a monitor command that would cause qemu to need to change to a more permissive access in the first place. Maybe the only reason that I'm still leaning towards a 'pass-fd' solution instead of a hook fd solution is that libvirt would have less code to write to get it working. But it was originally Dan's complaint that an rpc solution has too much risk for deadlock or leaks; if we are confident that we can avoid deadlock, and that the idea of passing in fds in response to an rpc involves less bookkeeping and speculation on libvirt's part about which monitor commands will require opening an fd, then maybe it really is a better technical solution, even if it costs more libvirt code to implement. -- Eric Blake eblake@xxxxxxxxxx +1-919-301-3266 Libvirt virtualization library http://libvirt.org
Attachment:
signature.asc
Description: OpenPGP digital signature
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list