What follows is a document outlining some thoughts I've been having on extending sVirt to allow confinement of applications which talk to libvirtd on the host, primarily focusing on use of SELinux, but also allowing a simple non-SElinux RBAC mechanism. Securing KVM virtualization hosts with MAC ========================================== This document looks at the task of securing KVM virtualizaton hosts using mandatory access control technologies, with focus on SELinux. At the time of writing there have been two phases of development, and this document makes proposals for a third phase. Phase 1: circa 2006 ------------------- Goal: Protect the host from a compromised virtual machine. The first phase of development had the modest goal of protecting the host from attack by a compromised virtual machine. To achieve this, the KVM processes are configured such that they will run under a confined security context ('virt_t' in the SELinux reference policy), which blocks access to any host resources not labelled ('virt_image_t') for use by virtual machines. The primary limitations of this initial implementation is that while the virtual host is secured, there is no protection between virtual machines. This can be considered a regression in isolation as compared to that offered by non-virtualized hosts. The second limitation is that the virtualization admin has to take care to ensure the host resources intended for use by the virtual machines are correctly labelled. This is a manual setup taks unless the images are kept in a preset location (/var/lib/libvirt/images in the SELinux reference policy). Phase 2: March 2009 ------------------- Goal: Protect virtual machines from each other The second phase of development has the goal of providing isolation between virtual machines that is comparable to that achieved between physical machines. This piece of work is commonly referred to as "svirt". The achieve this, the KVM processes are each configured to run under a dedicated security context, which blocks access to any resources not explicitly assigned to that virtual machine. In the SELinux implementation, the base context "svirt_t" has a unique MCS category ("c240,c955") appended to form a unique security context "system_u:system_r:svirt_t:s0:c240,c955". For each host resource to be assigned to the virtual machine, the base context "svirt_image_t" is combined with the same MCS category to form a unique resource security context "system_u:object_r:svirt_image_t:s0:c240,c955". The assignment of virtual machine security contexts and labelling of resources can be done statically by the administrator / management application, or dynamically by the libvirtd daemon. The latter removes much of the administrator burden. The second phase has addressed the major guest security limitation of the first phase, and eased the burden placed on host administors. Attention can now focus on the security of the host management software stack. Client applications communicate with the libvirtd daemon using a simple sockets based RPC protocol. Thus operations initiated by client applications which run under one security context are in fact invoked under the libvirtd daemon's security context. Since the libvirtd daemon is a highly privileged, almost unconfined process, this provides a means for applications to elevate their privileges. A second problem with the current model is seen when looking at guest migration between hosts. During migration, there are two QEMU processes running for the same virtual machine, one process on each host. The dynamic assignment of MCS values to form unique security contexts is done on a per host basis, so there is no guarantee that the VM on host A will be using (or be able to use) the same security context on the target host of migration. This is not neccessarily a problem if the guest is using block devices, since block device inode labels are only visible to a single host. With a shared filesystem that supports SELinux labelling, like GFS2, both QEMU processes must run in the same security context to allow them both to access the associated files. Phase 3: June 2011 ------------------ Goal: Protect virtual machines from host applications The third phase of development has the primary goal of honouring the confinement of client applications talking to libvirtd, when performing operations on virtual machines and other managed objects (storage pools, host devices, virtual networks, secrets, etc). Every application connecting to libvirt has an associated security context. Every object managed by libvirtd will have an associated security context. When an operation is invoked via a libvirt API the client application security context will be checked against the target object context, before proceeding. Thus applications will not be able to make use of a libvirtd connection to perform operations that are otherwise blocked. The secondary goal is to add further flexibility and safety to the way MCS categories are assigned, and files are relabelled. Instead of maintaining a local database of assigned labels, there must be some shared storage where label usage can be recorded. At its simplest this can be an NFS share, with one file per MCS category and locking with fcntl(). An alternative would to be acquire leases using a lock manager such as sanlock. In addition, the guest configuration will be enhanced such that a guest can be assigned a statically chosen security context, but still make use of dynamic relabelling of resources. Finally the existing boolean mode of 'static' vs 'dynmamic' label generation will be turned into a tri-state, introducing a 'hybrid' mode where the client supplies a custom base context, and the MCS part is still auto-generated. Usage scenarios --------------- To aid in development a couple of relevant core use cases or usage scenarios have been identified: 1. A virtual machine monitoring application For this example, consider the simple monitoring application 'virt-top'. This application displays a list of all virtual machines on the host and their associated resource utilization (CPU, disk, network). This application has no need to be able to stop/start/define virtual machines, nor do any operation related to host devices, storage, or networking. Traditionally this application is written to use a read only libvirt connection. With enhanced access control from libvirtd, the policy would define a new security context 'virt_top_t' for the 'virt-top' application. This policy would allow 'list', 'read', 'readstats' on the 'domain' object type. 2. A multi-guest, multi-user MLS enabled host For this example, consider a virtualizaton host with MLS policy that is running multiple virtual machines, for a variety of different users. A user with the security level "restricted" must not be allowed to control virtual machines with a security level of "confidential". Conversely a user with security level "secret" must not be allowed to create virtual machines with a security level of "unclassified". With enhanced access control from libvirtd, getpeercon() would provide the security context of the client application (user). The client context would be used to perform an AVC when any API operation is invoked, thus ensuring that the client's MLS label is honoured in access control checks. The effect would be that when an 'restricted' user asked for a list of virtual machines only virtual machines at level 'restricted' or below would be returned. Or when a "secret" user asked to start a guest when a security level of 'unclassified', the operation would be denied. 3. Identity transitions from trusted agents For this example, consider a trusted agent such as libvirt-qpid, or libvirt-snmp, which translates the libvirt API from its native model, into an alternate access model. In such an example, the agent talking to libvirtd will have authenticated itself. The peer identity that libvirtd sees, however, is that of the agent, not the ultimate (end-user) client. In such a case it will desirable to allow a trusted agent to transition to a different identity when performing operations. An end user running under context "unconfined_u:unconfined_r:virt_top_t:s0-s0:c0.c1023" may talk to the libvirt-qpid agent which runs under the context "system_u:system_r:virt_qpid_t:s0-s0:c0.c1023". The libvirt-qpid connects to libvirtd which sees 'virt_qpid_t' as the client type. The policy is written to allow transitions from 'virt_qpid_t' to the 'virt_top_t' type, so when the virt-top client connects to libvirt-qpid, it changes its identity to 'virt_top_t'. From that point onwards, all AVC checks honour the privileges of the ultimate end user application, rather than the libvirt-qpid intermediary. The same mechanism also ensures that the client application MLS level is transferred via the libvirt-qpid agent to libvirtd. Anticipated Development tasks ----------------------------- 1. Extend the domain XML to add a third attribute to the <seclabel> element relabel="yes|no", to control whether libvirtd will automatically label resources assigned to a guest. If the existing 'mode' attribute is "dynamic", then relabelling will default to enabled, while if it is 'static', then relabelling will default to disabled. Also change 'mode' to allow a new 'hybrid' value. 2. Determine how to maintain/identify security labels for other managed objects, including virStoragePoolPtr, virStorageVolPtr, virSecretPtr, virNetworkPtr, virInterfacePtr, virNodeDevicePtr, an host level APIs without any explicit managed object. 3. Extend XML for non-domain objects to implant security labels as identified in step 2. 4. Create an internal virIdentity struct to store the identity of the client. This will include at least the x509 distinguished name, the SASL username, the SELinux context (getpeercon()) and UNIX username/group (SCM_CREDENTIALS). 5. Create a new public API to allow a client application to supply a new identity, allowing them to pass a new x509 distinguished name, SASL username, SELinux context and UNIX username/group. 6. Extend the libvirtd daemon such that the current identity is stored in a thread local whenever invoking a public API operation. 7. Extend the QEMU driver such that a suitable identity is set when performing autonomous background operations such as domain auto-start and core dump, in a non-API thread. 8. Create a set of internal access control helper APIs in $libvirt/src/accesscontrol/. There will be one API for each managed object, talking an object pointer, and an operation identifier (from an enum). 9. Create a simple impl of the access control APIs which defines roles for groups of user identities, and grants privileges to each role based on the operation names. This allows for simple testing of internal infrastructure, and an RBAC mechanism for users who lack SELinux in their OS. 10. Implant access control checks into the main codepaths of every driver method implementations in the QEMU driver. 11. Change the SELinux reference policy to define the new security types and access vectors for the libvirt objects & associated API calls. 12. Create a SELinux impl of the access control APIs which invokes avc_has_perm() using the client's SELinux context. This is intended to be the primary RBAC mechanism for Fedora/RHEL virtualization hosts. 13. Write policy to confine targetted applications like virt-top, virt-mem. 14. Extend libvirt-snmp, libvirt-cim, libvirt-qpid to pass through the client identity to libvirtd. Technical Notes / Issues ------------------------ 1. Adding new SELinux security classes / access vectors The selinux security classes are defined in /usr/include/selinux/flask.h and access vectors in /usr/include/selinux/av_permissions.h Both of these files are automatically by a script in the selinux reference policy code '$serefpolicy/policy/flask/flask.py'. The master data files are in the same directory, 'access_vectors' and 'security_classes'. Once generated, the headers need to be manually copied into the libselinux package sources. APIs are added to libvirt on a very frequent basis. What is the process for applying access control to them if the SELinux policy does not yet have a suitable access vector / security class defined ? Do we need a generic 'admin' access vector we can use as catch all, until more specific vectors can be defined for the new APIs. Desirable to avoid having to lock-step upgrade libvirt with selinux policy for all additions to the libvirt public API. 2. Security contexts for libvirt managed objects virDomainPtr: Already embedded in XML, unless using dynamic labelling in which case context is assigned at startup. virNetworkPtr: No existing security context, nor any object on disk that could be used. Follow example of domains and embed <seclabel> in the XML. Assign unique MCS category per network and ensure that daemons launched per network (dnsmasq, radvd) inherit the MCS category. virSecretPtr: No existing security context. Secrets may be associated with disk paths for VMs. Could copy the security context of the guests and apply it to the secret, or have a dedicated type svirt_secret_t and just copy the MCS category. Hard to make it work for guests with dynamic MCS assignment. virStoragePoolPtr: No existing security context. Some pool types have objects existing on the host filesystem eg SCSI HBAs have a directory in sysfs, filesystem dirs have a directory somewhere, LVM has directory for the volume group in /dev. Other pool types have no object on disk anywhere convenient. eg Sheepdog. Other pool types only have an object on disk when the pool is active (eg iSCSI, NFS). So there is nothing to use for API checks when the pool is inactive. Likely have to ignore whatever associated resource is on disk and just store a security context in the XML config as with virDomainPtr/virNetworkPtr. virStorageVolPtr: Currently reports the SELinux security label associated with the file on disk. Not all pool types neccessarily have volumes with a corresponding file on disks (eg Sheepdog). virNodeDevicePtr: No existing security context. Most data comes from udev or HAL databases, though ultimately much is available in sysfs. When detaching PCI devices from host drivers, files in sysfs are used. When creating/deleting NPIV adapters sysfs is used. Thus could use sysfs file labels for AVC checks ? virConnectPtr: All host level APIs for which there is no other object aside from the nebulous concept of the 'host'. APIs are all readonly, eg query host capabilities, query free memory, CPU stats, etc. What if we gain APIs to make write calls. virInterfacePtr: No existing security context. Currently using netcf to get data from /etc/sysconfig/network-scripts/ifcfg-XXX files, but can't assume those file names since that is Fedora/RHEL specific. Might not even use netcf if it talks directly to network manager. Does netcf need to expose a security label based on the ifcfg-XXX file ? 3. Security labelling config modes When creating a guest the following XML snippets can be used. a. Default type, dynamic MCS, automatic relabelling <seclabel type='selinux' mode='dynamic' relabel='yes'/> b. Custom type, dynamic MCS, automatic relabelling <seclabel type='selinux' mode='hybrid' relabel='yes'> <label>system_u:system_r:mysvirt_t</label> <imagelabel>system_u:object_r:mysvirt_image_t</imagelabel> </seclabel> c. Default type, dynamic MCS, no relabelling <seclabel type='selinux' mode='dynamic' relabel='no'/> Does this mode make any sense, since admin doesn't know MCS category upfront ? Possibly only useful if the guest only has readonly disks. d. Custom type, dynamic MCS, no relabelling <seclabel type='selinux' mode='hybrid' relabel='no'> <label>system_u:system_r:mysvirt_t</label> </seclabel> Same question about whether it makes sense e. Custom type, static MCS, auto relabelling <seclabel type='selinux' mode='static' relabel='yes'> <label>system_u:system_r:mysvirt_t:s0:c123,c456</label> <imagelabel>system_u:system_r:mysvirt_image_t:s0:c123,c456</imagelabel> </seclabel> f. Custom type, static MCS, no relabelling <seclabel type='selinux' mode='static' relabel='no'> <label>system_u:system_r:mysvirt_t:s0:c123,c456</label> </seclabel> 4. Time at which to apply checks / source context It would be desirable to restrict the ability to use automatic file relabelling within the policy. If a client application defines a guest with the 'relabel=yes' attribute set, at what time should this usage be validated ? Validate at the time the guest is defined ? This ensures the app defining the guest is suitably privileged, but the file labels might be changed by the time the guest starts. Validate at the time the guest is started ? This minimises the window between access check being performed, and libvirtd actually performing the relabel operation. The app starting the guest might be different from the one defining the guest though ? Check at both define + start time ? What source security context should we use when performing autostart of virtual machines ? Normally when starting a VM, the check would be performed using the context of the client invoking the start API, but there is no such client when autostart occurs. Should we instead perform a 'start' operation check whenever the 'autostart' flag is turned on by a client ? Or check the autostart operation against some generic source context ? -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list