Hi David, thanks for the added input! I'm taking the liberty to snip a few paragraphs to trim this email down a bit. On Thu, Feb 8, 2018 at 1:07 PM, David Hildenbrand <david@xxxxxxxxxx> wrote: >> Just to give an example, >> https://www.redhat.com/en/blog/inception-how-usable-are-nested-kvm-guests >> from just last September talks explicitly about how "guests can be >> snapshot/resumed, migrated to other hypervisors and much more" in the >> opening paragraph, and then talks at length about nested guests — >> without ever pointing out that those very features aren't expected to >> work for them. :) > > Well, it still is a kernel parameter "nested" that is disabled by > default. So things should be expected to be shaky. :) While running > nested guests work usually fine, migrating a nested hypervisor is the > problem. > > Especially see e.g. > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/nested_virt > > "However, note that nested virtualization is not supported or > recommended in production user environments, and is primarily intended > for development and testing. " Sure, I do understand that Red Hat (or any other vendor) is taking no support responsibility for this. At this point I'd just like to contribute to a better understanding of what's expected to definitely _not_ work, so that people don't bloody their noses on that. :) >> So to clarify things, could you enumerate the currently known >> limitations when enabling nesting? I'd be happy to summarize those and >> add them to the linux-kvm.org FAQ so others are less likely to hit >> their head on this issue. In particular: > > The general problem is that migration of an L1 will not work when it is > running L2, so when L1 is using VMX ("nVMX"). > > Migrating an L2 should work as before. > > The problem is, in order for L1 to make use of VMX to run L2, we have to > run L2 in L0, simulating VMX -> nested VMX a.k.a. nVMX . This requires > additional state information about L1 ("nVMX" state), which is not > properly migrated when migrating L1. Therefore, after migration, the CPU > state of L1 might be screwed up after migration, resulting in L1 crashes. > > In addition, certain VMX features might be missing on the target, which > also still has to be handled via the CPU model in the future. Thanks a bunch for the added detail. Now I got a primer today from Kashyap on IRC on how savevm/loadvm is very similar to migration, but I'm still struggling to wrap my head around it. What you say makes perfect sense to me in that _migration_ might blow up in subtle ways, but can you try to explain to me why the same considerations would apply with savevm/loadvm? > L0, should hopefully not crash, I hope that you are not seeing that. No I am not; we're good there. :) >> - Is https://fedoraproject.org/wiki/How_to_enable_nested_virtualization_in_KVM >> still accurate in that -cpu host (libvirt "host-passthrough") is the >> strongly recommended configuration for the L2 guest? >> >> - If so, are there any recommendations for how to configure the L1 >> guest with regard to CPU model? > > You have to indicate the VMX feature to your L1 ("nested hypervisor"), > that is usually automatically done by using the "host-passthrough" or > "host-model" value. If you're using a custom CPU model, you have to > enable it explicitly. Roger. Without that we can't do nesting at all. >> - Is live migration with nested guests _always_ expected to break on >> all architectures, and if not, which are safe? > > x86 VMX: running nested guests works, migrating nested hypervisors does > not work > > x86 SVM: running nested guests works, migrating nested hypervisor does > not work (somebody correct me if I'm wrong) > > s390x: running nested guests works, migrating nested hypervisors works > > power: running nested guests works only via KVM-PR ("trap and emulate"). > migrating nested hypervisors therefore works. But we are not using > hardware virtualization for L1->L2. (my latest status) > > arm: running nested guests is in the works (my latest status), migration > is therefore also not possible. Great summary, thanks! >> - Idem, for savevm/loadvm? >> > > savevm/loadvm is not expected to work correctly on an L1 if it is > running L2 guests. It should work on L2 however. Again, I'm somewhat struggling to understand this vs. live migration — but it's entirely possible that I'm sorely lacking in my knowledge of kernel and CPU internals. >> - With regard to the problem that Kashyap and I (and Dennis, the >> kernel.org bugzilla reporter) are describing, is this expected to work >> any better on AMD CPUs? (All reports are on Intel) > > No, remeber that they are also still missing migration support of the > nested SVM state. Understood, thanks. >> - Do you expect nested virtualization functionality to be adversely >> affected by KPTI and/or other Meltdown/Spectre mitigation patches? > > Not an expert on this. I think it should be affected in a similar way as > ordinary guests :) Fair enough. :) >> Kashyap, can you think of any other limitations that would benefit >> from improved documentation? > > We should certainly document what I have summaries here properly at a > central palce! I tried getting registered on the linux-kvm.org wiki to do exactly that, and ran into an SMTP/DNS configuration issue with the verification email. Kashyap said he was going to poke the site admin about that. Now, here's a bit more information on my continued testing. As I mentioned on IRC, one of the things that struck me as odd was that if I ran into the issue previously described, the L1 guest would enter a reboot loop if configured with kernel.panic_on_oops=1. In other words, I would savevm the L1 guest (with a running L2), then loadvm it, and then the L1 would stack-trace, reboot, and then keep doing that indefinitely. I found that weird because on the second reboot, I would expect the system to come up cleanly. I've now changed my L2 guest's CPU configuration so that libvirt (in L1) starts the L2 guest with the following settings: <cpu> <model fallback='forbid'>Haswell-noTSX</model> <vendor>Intel</vendor> <feature policy='disable' name='vme'/> <feature policy='disable' name='ss'/> <feature policy='disable' name='f16c'/> <feature policy='disable' name='rdrand'/> <feature policy='disable' name='hypervisor'/> <feature policy='disable' name='arat'/> <feature policy='disable' name='tsc_adjust'/> <feature policy='disable' name='xsaveopt'/> <feature policy='disable' name='abm'/> <feature policy='disable' name='aes'/> <feature policy='disable' name='invpcid'/> </cpu> Basically, I am disabling every single feature that my L1's "virsh capabilities" reports. Now this does not make my L1 come up happily from loadvm. But it does seem to initiate a clean reboot after loadvm, and after that clean reboot it lives happily. If this is as good as it gets (for now), then I can totally live with that. It certainly beats running the L2 guest with Qemu (without KVM acceleration). But I would still love to understand the issue a little bit better. Cheers, Florian