On Thu, Feb 06, 2020 at 01:05:37PM +0000, Daniel P. Berrangé wrote: The core content reads very well. A couple of minor nit-picks inline. [...] > diff --git a/docs/kbase/qemu-passthrough-security.rst b/docs/kbase/qemu-passthrough-security.rst > new file mode 100644 > index 0000000000..7fb1f6fbdd > --- /dev/null > +++ b/docs/kbase/qemu-passthrough-security.rst > @@ -0,0 +1,157 @@ [...] > +XML document additions > +====================== > + > +To deal with the problem, libvirt introduced support for command line Nit: s/command line/command-line/g (there are a few occurrences) > +passthrough of QEMU arguments. This is achieved by supporting a custom > +XML namespace, under which some QEMU driver specific elements are defined. > + > +The canonical place to declare the namespace is on the top level ``<domain>`` > +element. At the very end of the document, arbitrary command line arguments > +can now be added, using the namespace prefix ``qemu:`` > + > +:: If you can stomach the syntax chance, you can put the :: at the end of the sentence. > + > + <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> > + <name>QEMUGuest1</name> > + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> > + ... > + <qemu:commandline> > + <qemu:arg value='-newarg'/> > + <qemu:arg value='parameter'/> I'd guess you intentionally took a generic example, rather than specific QEMU command-line parameter to illustrate the XML, in case the example command-line is deprecated, etc. > + <qemu:env name='ID' value='wibble'/> > + <qemu:env name='BAR'/> > + </qemu:commandline> > + </domain> Is it worth calling out that the 'env' fragments are envirnoment variables? As it isn't obvious to those who don't dwell on libvirt/QEMU daily. > +Note that when an argument takes a value eg ``-newarg parameter``, the argument > +and the value must be passed as separate ``<qemu:arg>`` entries. > > + > +Instead of declaring the XML namespace on the top level ``<domain>`` it is also > +possible to declare it at time of use, which is more convenient for humans > +writing the XML documents manually. So the following example is functionally > +identical: > + > +:: Here too, you can put the :: at the end of the sentence, saving one colon :D > + > + <domain type='kvm'> > + <name>QEMUGuest1</name> > + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> > + ... > + <commandline xmlns="http://libvirt.org/schemas/domain/qemu/1.0"> > + <arg value='-newarg'/> > + <arg value='parameter'/> > + <env name='ID' value='wibble'/> > + <env name='BAR'/> > + </commandline> > + </domain> > + > +Note that when querying the XML from libvirt, it will have been translated into > +the canonical syntax once more with the namespace on the top level element. Here you might want to use the rST "note" admonition: .. note:: When querying the XML from libvirt, it will have been translated into canonical syntax once more with the namespace on the top level element. > + > +Security confinement / sandboxing > +================================= > + > +When libvirt launches a QEMU process it makes use of a number of security > +technologies to confine QEMU and thus protect the host from malicious VM > +breakouts. > + > +When configuring security protection, however, libvirt generally needs to know > +exactly which host resources the VM is permitted to access. It gets this > +information from the domain XML document. This only works for elements in the > +regular schema, the arguments used with command line passthrough are completely > +opaque to libvirt. > + > +As a result, if command line passthrough is used to expose a file on the host > +to QEMU, the security protections will activate and either kill QEMU or deny it > +access. > + > +There are two strategies for dealing with this problem, either figure out what > +steps are needed to grant QEMU access to the device, or disable the security > +protections. The former is harder, but more secure, while the latter is simple. > + > +Granting access per VM > +---------------------- > + > +* SELinux - the file on the host needs an SELinux label that will grant access > + to QEMU's ``svirt_t`` policy. > + > + - Read only access - use the ``virt_content_t`` label Nit: s/"Read only"/Read-only/ > + - Shared, write access - use the ``svirt_image_t:s0`` label (ie no MCS > + category appended) > + - Exclusive, write access - use the ``svirt_image_t:s0:MCS`` label for the VM. > + The MCS is auto-generatd at boot time, so this may require re-configuring > + the VM to have a fixed MCS label > + > +* DAC - the file on the host needs to be readable/writable to the ``qemu`` Nit: let's please expand acronyms on first use: "Discretionary Access Control (DAC)"; although DAC and ACL (below) might be common enough for "Linux dwellers" that we don't have to be pedantic about it. But MCS (Multi-Category Security) is familiar only for those who are SELinux-aware. So, your choice, as I don't want to make you expand every acronym; but only the obscure ones. :-) > + user or ``qemu`` group. This can be done by changing the file ownership to > + ``qemu``, or relaxing the permissions to allow world read, or adding file > + ACLs to allow access to ``qemu``. > + > +* Namespaces - a private ``mount`` namespace is used for QEMU by default > + which populates a new ``/dev`` with only the device nodes needed by QEMU. > + There is no way to augment the set of device nodes ahead of time. > + > +* Seccomp - libvirt launches QEMU with its built-in seccomp policy enabled with > + ``obsolete=deny``, ``elevateprivileges=deny``, ``spawn=deny`` and > + ``resourcecontrol=deny`` settings active. There is no way to change this > + policy on a per VM basis Missing full stop at the end here ... > + > +* Cgroups - a custom cgroup is created per VM and this will either use the > + ``devices`` controller or an ``BPF`` rule to whitelist a set of device nodes. > + There is no way to change this policy on a per VM basis. > + > +Disabling security protection per VM > +------------------------------------ > + > +Some of the security protections can be disabled per-VM: > + > +* SELinux - in the domain XML the ``<seclabel>`` model can be changed to > + ``none`` instead of ``selinux``, which will make the VM run unconfined. > + > +* DAC - in the domain XML an ``<seclabel>`` element with the ``dac`` model can > + be added, configured with a user / group account of ``root`` to make QEMU run > + with full privileges ... here, > +* Namespaces - there is no way to disable this per VM > + > +* Seccomp - there is no way to disable this per VM > + > +* Cgroups - there is no way to disable this per VM > + > +Disabling security protection host-wide > +--------------------------------------- > + > +As a last resort it is possible to disable security protection host wide which > +will affect all virtual machines. These settings are all made in > +``/etc/libvirt/qemu.conf`` ... and here. > + > +* SELinux - set ``security_default_confied = 0`` to make QEMU run unconfined by > + default, while still allowing explicit opt-in to SELinux for VMs. > + > +* DAC - set ``user = root`` and ``group = root`` to make QEMU run as the root > + account > + > +* SELinux, DAC - set ``security_driver = []`` to entirely disable both the > + SELinux and DAC security drivers. > + > +* Namespaces - set ``namespaces = []`` to disable use of the ``mount`` > + namespaces, causing QEMU to see the normal fully popualated ``dev`` > + > +* Seccomp - set ``seccomp_sandbox = 0`` to disable use of the Seccomp sandboxing > + in QEMU > + > +* Cgroups - set ``cgroup_device_acl`` to include the desired device node, or > + ``cgroup_controllers = [...]`` to exclude the ``devices`` controller. I'll let you pick what you want to address, as this doc is an improvement as-is, FWIW: Reviewed-by: Kashyap Chamarthy <kchamart@xxxxxxxxxx> -- /kashyap