On Fri, Feb 07, 2020 at 04:01:57PM +0000, Dr. David Alan Gilbert wrote: > * Kashyap Chamarthy (kchamart@xxxxxxxxxx) wrote: [...] > > +Running nested guests with KVM > > +============================== > > + > > +A nested guest is a KVM guest that in turn runs on a KVM guest:: > > Note nesting maybe a little more general; e.g. L1 might be another > OS/hypervisor that wants to run it's own L2; and similarly > KVM might be the L1 under someone elses hypervisor. True, I narrowly focused on KVM-on-KVM. Will take this approach: I'll mention the generic nature of nesting, but focus on KVM-on-KVM in this document. > I think this doc is mostly about the case of KVM being the L0 > and wanting to run an L1 that's capable of running an L2. > > > + .----------------. .----------------. > > + | | | | > > + | L2 | | L2 | > > + | (Nested Guest) | | (Nested Guest) | > > + | | | | > > + |----------------'--'----------------| > > + | | > > + | L1 (Guest Hypervisor) | > > + | KVM (/dev/kvm) | > > + | | > > + .------------------------------------------------------. > > + | L0 (Host Hypervisor) | > > + | KVM (/dev/kvm) | > > + |------------------------------------------------------| > > + | x86 Hardware (VMX) | > > + '------------------------------------------------------' > > This is now x86 specific but the doc is in a general directory; > I'm not sure what other architecture nesting rules are. Yeah, x86 is the beast I knew, so I stuck to it. But since this is upstream doc, I should bear in mind to clearly mention s390x and other architectures. > Woth having VMX/SVM at least. Will add. [...] > > + > > +Use Cases > > +--------- > > + > > +An additional layer of virtualization sometimes can . You > > +might have access to a large virtual machine in a cloud environment that > > +you want to compartmentalize into multiple workloads. You might be > > +running a lab environment in a training session. > > Lose this paragraph, and just use the list below? That was precisely my intention, but I didn't commit the local version before sending. Will fix in v2. > > +There are several scenarios where nested KVM can be Useful: > > + > > + - As a developer, you want to test your software on different OSes. > > + Instead of renting multiple VMs from a Cloud Provider, using nested > > + KVM lets you rent a large enough "guest hypervisor" (level-1 guest). > > + This in turn allows you to create multiple nested guests (level-2 > > + guests), running different OSes, on which you can develop and test > > + your software. > > + > > + - Live migration of "guest hypervisors" and their nested guests, for > > + load balancing, disaster recovery, etc. > > + > > + - Using VMs for isolation (as in Kata Containers, and before it Clear > > + Containers https://lwn.net/Articles/644675/) if you're running on a > > + cloud provider that is already using virtual machines The last use-case was pointed out by Paolo elsewhere. (I should make this more generic.) > Some others that might be worth listing; > - VM image creation tools (e.g. virt-install etc) often run their own > VM, and users expect these to work inside a VM. > - Some other OS's use virtualization internally for other > features/protection. Yeah. Will add; thanks! > > +Procedure to enable nesting on the bare metal host > > +-------------------------------------------------- > > + > > +The KVM kernel modules do not enable nesting by default (though your > > +distribution may override this default). > > It's the other way; see 1e58e5e for intel has made it default; AMD has > it set as default for longer. Ah, this was another bit I realized later, but forgot to fix before sending to the list. (I recall seeing this when it came out about a year ago: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1e58e5e) Will fix. Thanks for the eagle eyes :-) > > +Additional nested-related kernel parameters > > +------------------------------------------- > > + > > +If your hardware is sufficiently advanced (Intel Haswell processor or > > +above which has newer hardware virt extensions), you might want to > > +enable additional features: "Shadow VMCS (Virtual Machine Control > > +Structure)", APIC Virtualization on your bare metal host (L0). > > +Parameters for Intel hosts:: > > + > > + $ cat /sys/module/kvm_intel/parameters/enable_shadow_vmcs > > + Y > > + > > + $ cat /sys/module/kvm_intel/parameters/enable_apicv > > + N > > + > > + $ cat /sys/module/kvm_intel/parameters/ept > > + Y > > Don't those happen automatically (mostly?) EPT, yes. I forget if `enable_shadow_vmcs` and `enable_apivc` are. I'll investigate and update. [...] > > +Limitations on Linux kernel versions older than 5.3 > > +--------------------------------------------------- > > + > > +On Linux kernel versions older than 5.3, once an L1 guest has started an > > +L2 guest, the L1 guest would no longer capable of being migrated, saved, > > +or loaded (refer to QEMU documentation on "save"/"load") until the L2 > > +guest shuts down. [FIXME: Is this limitation fixed for *all* > > +architectures, including s390x?] > > + > > +Attempting to migrate or save & load an L1 guest while an L2 guest is > > +running will result in undefined behavior. You might see a ``kernel > > +BUG!`` entry in ``dmesg``, a kernel 'oops', or an outright kernel panic. > > +Such a migrated or loaded L1 guest can no longer be considered stable or > > +secure, and must be restarted. > > + > > +Migrating an L1 guest merely configured to support nesting, while not > > +actually running L2 guests, is expected to function normally. > > +Live-migrating an L2 guest from one L1 guest to another is also expected > > +to succeed. > > Can you add an entry along the lines of 'reporting bugs with nesting' > that explains you should clearly state what the host CPU is, > and the exact OS and hypervisor config in L0,L1 and L2 ? Yes, good point. I'll add a short version based my notes from here (which you've reviewed in the past): https://kashyapc.fedorapeople.org/Notes/_build/html/docs/Info-to-collect-when-debugging-nested-KVM.html#what-information-to-collect Thanks for the review. -- /kashyap