The problem(s) ============== The libvirtd architecture has evolved over time, initially as an expediant solution to the problem of managing virtual networks and QEMU processes, and over time came to control all the other resources too. It is only avoided in the case of the stateless hypervisor drivers which talk to remote RPC systems (VMWare ESX, HyperV, etc). We later introduced the concepts of loadable modules, and separate daemons for locking and logging because of the key requirement that the latter services be re-exec()able while VMs are running. Despite the existance virtlogd & virtlockd, the libvirtd daemon is clearly using the monolithic service model. This has a direction impact on both the reliability and security of libvirtd. QEMU has the nice characteristic that since it is just a regular process, if one QEMU goes bad, the other QEMUs continue to operate normally. Libvirtd then throws away this advantage, by introducing an architecture where if one QEMU goes bad, it can easily impact all other QEMU processes. This can either be due to libvirtd crashing, preventing mgmt of all resources, or due to a rogue QEMU giving libvirtd so much work todo that other jobs get starved. When we first hit this we introduced multithreading inside libvirtd, which did help, but made life more complicated. We then saw bottlenecks on the QEMU driver level locks and had to switch to a lockless driver, with just the VM locks. We then also had to introduce the job concept, and then async job concept to allow APIs to complete while the monitor is being used. There are still concurrency problems in this area, for example QMP event processing in the main thread can block other API calls and keepalives for arbitrary amounts of time. It is worse though, because a problem in other areas of libvirtd related to storage, networking, node devices, and so on can also impact ability to manage QEMU and vica-verca. This is inherant in the monolithic design of libvirtd where a single daemon does everything. There are 100's of 1000's of lines of complex code, and a single bug can impact everything inside libvirtd. The monolithic model is bad for security too. Given the broad set of features supported by libvirtd it is impossible to write any meaningful SELinux policy to lock down its capabilities, unless you're willing to simply block large feature sets. What is worse is that many of these features require root privileges, so libvirtd has a whole needs to run as root, and has no security confinement. Libvirtd meanwhile has to directly interact with non-trusted components such as the QEMU monitor console, so its security is paramount to preventing a malicious QEMU from escaping its confinement. To the best of my knowledge no one has tried to break out of QEMU by attacking libvirtd via QMP, but that's probably just because they've not told us. The final problem with libvirtd is the split between system and session mode. We've long told people that session mode is for desktop virt and system mode is for server virt, but this simple explanation of roles fails in the real world. It has been a source of pain for libguestfs for example, which wants to be able to simply run QEMU with the same rights as the application which invokes libguestfs. The system vs session distinction means it often hits problems where the app using libguestfs can read the disk file, but QEMU launched by libvirtd on libguestfs' behalf cannot read it. Then there is the fact that with session mode, network connectivity is a disaster. We hacked around this by using a setuid helper, which lets the admin grant a user the ability to access a specific bridge device. The mgmt app though is locked out of all the virtual network management APIs with the session instance. The conceptual model here is really wrong. Just because you want to have the QEMU processes runing under the unprivileged user, doesn't imply that you want the network managements APIs under the same user account. In retrospect, simply duplicating the privileged libvirtd functionality in a non-privileged libvirtd was a clear mistake. Some areas of functionality inherantly require a privilegd environment and should only ever have run inside root libvirtd. The solution(s) =============== As noted above, we made some baby-steps towards a modular daemon architecture when we introduced virtlockd and virlogd. It is now time to fully commit to a modular design and explode libvirtd into a swarm of daemons each responsible for a clearly demarked task. Such a decomposition would naturally fall across the internal driver boundaries, giving a virtnwfilterd, virtnetworkd, virtstoraged, virtnodedevd, etc. We have to maintain compatibility with our existing client API implementation though. This libvirtd would have to still accept connections from the client and route the RPC request directly onto the modular daemon. We could also enhance the client API to directly know how to connect to the modular daemons, bypassing libvirtd. If we restricted the modular daemons to only concern themselves with local UNIX domain socket usage, we could then provide libvirtd as the bridge to remote TCP access, and for backcompat with legacy client library impls. [app] -> [libvirt.so] -> [libvirtd] becomes [app] -> [libvirt.so] -> [virthypervisord] +> [virtnetworkd] +> [virtstoraged] ...etc With this more modular design, we now have the flexibilty to make the non-root libvirt usage more usable in the real world. For example, desktop virt can now use non-root virthypervisord to manage QEMU processes under the local user but connect to the privileged virtnetworkd to see the network connectivity. The non-root virthypervisord would also talk to virtnetworkd to acquire a TAP device for the guest during startup, with the FD being passed back across the UNIX socket. This gives us finer grained access control options, where we can selectively require the root password depending on the featureset the guest is requesting. For example, non-root libvirt could require root password in order to acquire access to a vGPU device from privileged virtnodedevd. The modular design also potentially unlocks the functionality of libvirt so that it can be used in isolation. For example, there are scenarios where a management application may wish to use the storage pools API to manage a pool of disk images but doesn't need anything related to the hypervisr. Currently you're forced to have a hypervisor driver present in libvirtd to get a connection, even if you'll never use it. Even with a virthypervisord separated out from libvirtd, it is still effectively a monolithic design from the POV of the hypervisor components. So a problem in interacting with any single QEMU process still has the potential to negatively impact our ability to manage over QEMU processes. And of course a code bug that causes a crash takes out the ability to manage everything. The previous mail describes a change to introduce a 'libvirt-qemu' shim to manage startup for an individual QEMU process. Once this shim process exists, the obvious question to ask is whether it can take responsibility for ongoing management of the QEMU process, essentially owning the monitor connection. A very large portion of the virDomain related APIs are naturally scoped to only operate on a single QEMU process. Essentially they invoke monitor APIs and get responses, acting as a transformation layer between the libvirt API/XML format and the QMP format. Their implementation does, however, often touch global state when dealing with acquisition of shared resources such as PCI devices, network devices, etc. The allocation of such shared state should be the responsibility of the individual daemons though (virtnodedevd, virtnetworkd, etc). With all this in mind, it would be possible to move the bulk of individual QEMU management into the 'libvirt-qemu' shim. The virthypervisord would essentially act as an aggregation service and registry. It would handle the APIs that deal with bulk querying of resources, and ensuring uniqueness of domain UUIUDs and names, etc. Any functional operations on individual guests would be simly passed onto the respective 'libvirt-qemu' shim. [app] -> [libvirt.so] -> [virthypervisord] -> [libvirt-shim] -> [qemu] +> [libvirt-shim] -> [qemu] +> [libvirt-shim] -> [qemu] +> [libvirt-shim] -> [qemu] One might suggest that this would just inherit all the same problems the current libvirtd has, just with the QMP monitor interaction replaced by RPC calls. The key difference here though is that when libvirtd deals with QEMU it is forced to call into the synchronous libvirt.so public API to execute individual API calls. This forced libvirtd to take the approach of creating many worker threads to execute blocking APIs. By contrast when the virthypervisord daemon calls into the 'libvirt-shim' to perform a command, it would directly use the low level RPC APIs we have. This would enable it to implement a full asynchronous approach and not require a big pool of worker threads that block. While it would not magically solve all scalability problems, it would be a less complex internal code flow with less juggling of threads. More importantly is that a bug in any of the QEMU driver logic relating to QMP would only affect that single 'libvirt-qemu' process which improves the overall system reliability and potentially offers a more secure system. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list