This is a rewrite of the Wiki page: https://www.linux-kvm.org/page/Nested_Guests Signed-off-by: Kashyap Chamarthy <kchamart@xxxxxxxxxx> --- Question: is the live migration of L1-with-L2-running-in-it fixed for *all* architectures, including s390x? --- .../virt/kvm/running-nested-guests.rst | 171 ++++++++++++++++++ 1 file changed, 171 insertions(+) create mode 100644 Documentation/virt/kvm/running-nested-guests.rst diff --git a/Documentation/virt/kvm/running-nested-guests.rst b/Documentation/virt/kvm/running-nested-guests.rst new file mode 100644 index 0000000000000000000000000000000000000000..e94ab665c71a36b7718aebae902af16b792f6dd3 --- /dev/null +++ b/Documentation/virt/kvm/running-nested-guests.rst @@ -0,0 +1,171 @@ +Running nested guests with KVM +============================== + +A nested guest is a KVM guest that in turn runs on a KVM guest:: + + .----------------. .----------------. + | | | | + | L2 | | L2 | + | (Nested Guest) | | (Nested Guest) | + | | | | + |----------------'--'----------------| + | | + | L1 (Guest Hypervisor) | + | KVM (/dev/kvm) | + | | + .------------------------------------------------------. + | L0 (Host Hypervisor) | + | KVM (/dev/kvm) | + |------------------------------------------------------| + | x86 Hardware (VMX) | + '------------------------------------------------------' + + +Terminology: + + - L0 – level-0; the bare metal host, running KVM + + - L1 – level-1 guest; a VM running on L0; also called the "guest + hypervisor", as it itself is capable of running KVM. + + - L2 – level-2 guest; a VM running on L1, this is the "nested guest" + + +Use Cases +--------- + +An additional layer of virtualization sometimes can . You +might have access to a large virtual machine in a cloud environment that +you want to compartmentalize into multiple workloads. You might be +running a lab environment in a training session. + +There are several scenarios where nested KVM can be Useful: + + - As a developer, you want to test your software on different OSes. + Instead of renting multiple VMs from a Cloud Provider, using nested + KVM lets you rent a large enough "guest hypervisor" (level-1 guest). + This in turn allows you to create multiple nested guests (level-2 + guests), running different OSes, on which you can develop and test + your software. + + - Live migration of "guest hypervisors" and their nested guests, for + load balancing, disaster recovery, etc. + + - Using VMs for isolation (as in Kata Containers, and before it Clear + Containers https://lwn.net/Articles/644675/) if you're running on a + cloud provider that is already using virtual machines + + +Procedure to enable nesting on the bare metal host +-------------------------------------------------- + +The KVM kernel modules do not enable nesting by default (though your +distribution may override this default). To enable nesting, set the +``nested`` module parameter to ``Y`` or ``1``. You may set this +parameter persistently in a file in ``/etc/modprobe.d`` in the L0 host: + +1. On the bare metal host (L0), list the kernel modules, and ensure that + the KVM modules:: + + $ lsmod | grep -i kvm + kvm_intel 133627 0 + kvm 435079 1 kvm_intel + +2. Show information for ``kvm_intel`` module:: + + $ modinfo kvm_intel | grep -i nested + parm: nested:boolkvm 435079 1 kvm_intel + +3. To make nested KVM configuration persistent across reboots, place the + below entry in a config attribute:: + + $ cat /etc/modprobe.d/kvm_intel.conf + options kvm-intel nested=y + +4. Unload and re-load the KVM Intel module:: + + $ sudo rmmod kvm-intel + $ sudo modprobe kvm-intel + +5. Verify if the ``nested`` parameter for KVM is enabled:: + + $ cat /sys/module/kvm_intel/parameters/nested + Y + +For AMD hosts, the process is the same as above, except that the module +name is ``kvm-amd``. + +Once your bare metal host (L0) is configured for nesting, you should be +able to start an L1 guest with ``qemu-kvm -cpu host`` (which passes +through the host CPU's capabilities as-is to the guest); or for better +live migration compatibility, use a named CPU model supported by QEMU, +e.g.: ``-cpu Haswell-noTSX-IBRS,vmx=on`` and the guest will subsequently +be capable of running an L2 guest with accelerated KVM. + +Additional nested-related kernel parameters +------------------------------------------- + +If your hardware is sufficiently advanced (Intel Haswell processor or +above which has newer hardware virt extensions), you might want to +enable additional features: "Shadow VMCS (Virtual Machine Control +Structure)", APIC Virtualization on your bare metal host (L0). +Parameters for Intel hosts:: + + $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs + Y + + $ cat /sys/module/kvm_intel/parameters/enable_apicv + N + + $ cat /sys/module/kvm_intel/parameters/ept + Y + +Again, to persist the above values across reboot, append them to +``/etc/modprobe.d/kvm_intel.conf``:: + + options kvm-intel nested=y + options kvm-intel enable_shadow_vmcs=y + options kvm-intel enable_apivc=y + options kvm-intel ept=y + + +Live migration with nested KVM +------------------------------ + +The below live migration scenarios should work as of Linux kernel 5.3 +and QEMU 4.2.0. In all the below cases, L1 exposes ``/dev/kvm`` in +it, i.e. the L2 guest is a "KVM-accelerated guest", not a "plain +emulated guest" (as done by QEMU's TCG). + +- Migrating a nested guest (L2) to another L1 guest on the *same* bare + metal host. + +- Migrating a nested guest (L2) to another L1 guest on a *different* + bare metal host. + +- Migrating an L1 guest, with an *offline* nested guest in it, to + another bare metal host. + +- Migrating an L1 guest, with a *live* nested guest in it, to another + bare metal host. + + +Limitations on Linux kernel versions older than 5.3 +--------------------------------------------------- + +On Linux kernel versions older than 5.3, once an L1 guest has started an +L2 guest, the L1 guest would no longer capable of being migrated, saved, +or loaded (refer to QEMU documentation on "save"/"load") until the L2 +guest shuts down. [FIXME: Is this limitation fixed for *all* +architectures, including s390x?] + +Attempting to migrate or save & load an L1 guest while an L2 guest is +running will result in undefined behavior. You might see a ``kernel +BUG!`` entry in ``dmesg``, a kernel 'oops', or an outright kernel panic. +Such a migrated or loaded L1 guest can no longer be considered stable or +secure, and must be restarted. + +Migrating an L1 guest merely configured to support nesting, while not +actually running L2 guests, is expected to function normally. +Live-migrating an L2 guest from one L1 guest to another is also expected +to succeed. -- 2.21.0