On Wed, Sep 09, 2020 at 01:04:45PM +0300, stsp wrote: > Hi Guys! > > I have a kvm-based hypervisor, and also I have problems with how KVM handles > cr4.VMXE flag. > > Problem 1 can be shown as follows. > > The below snippet WORKS as expected: > --- > sregs.cr4 |= X86_CR4_VMXE; What is the starting value of sregs? Not that it should matter, but it'd be helpful to reproduce and understand the issue. > ret = ioctl(vcpufd, KVM_SET_SREGS, &sregs); > if (ret == -1) { > perror("KVM: KVM_SET_SREGS"); > leavedos(99); > } > --- > > The below one doesn't: > --- > ret = ioctl(vcpufd, KVM_SET_SREGS, &sregs); > if (ret == -1) { > perror("KVM: KVM_SET_SREGS"); > leavedos(99); > } > sregs.cr4 |= X86_CR4_VMXE; > ret = ioctl(vcpufd, KVM_SET_SREGS, &sregs); > if (ret == -1) { > perror("KVM: KVM_SET_SREGS"); > leavedos(99); > } > --- > > Basically that example demonstrates that I can set VMXE flag only by the very > first call to KVM_SET_SREGS. Any subsequent calls do not allow me to modify > VMXE flag, even though no error is returned, and other flags are modified, if > needed, as expected, but not this one. Is there any reason why VMXE flag is > "locked" to its very first setting? IIUC, in the above snippet, you observe that "ret == 0" but rereading sregs shows the old CR4 value? The direct cause of the weirdness is a KVM bug in KVM_SET_SREGS where it doesn't check the return of vendor specific handling (VMX vs. SVM) of setting CR4. In this specific case, odds are good you're running afould of the check that disallows VMXE=1 if nested virtualization is not supported for the VM. As to why only the second variant fails, KVM_SET_REGS triggers a CPUID update for the guest, which will reevaluate whether or not the guest supports nested virtualization. In theory, that could trigger the behavior you're seeing, though I would expect guest CPUID to be accurate before the first KVM_SET_SREGS. So short answer, I have no idea :-) > Problem 2: > If I set both VME and VMXE flags (by the very first invocation of > KVM_SET_SREGS, yes), then VME flag does not actually work. My hypervisor then > runs in non-VME mode. Is it KVM that clears the VME flag when VMXE is set, > or is it really not a workable combination of flags? What do you mean by "My hypervisor runs in non-VME mode"? I assume you mean the guest is in non-VME mode? Or do you really mean CR4.VME in the host? If you're referring to the guest, what CPU generation are you running? From the above descriptions, it sounds like you're on Nehalem (first gen Core) or earlier (e.g. Core2). I ask because the answer gets quite complicated if you're running on hardware without support for unrestricted guest. If you're talking about host CR4, when do you observe CR4.VME being cleared? KVM is supposed to preserve the current CR4 value for the host. > Problem 3. > Some older Intel CPUs appear to require the VMXE flag even in non-root VMX. > This is vaguely documented in an Intel specs: > --- > The first processors to support VMX operation require that the > following bits be 1 in VMX operation: CR0.PE, CR0.NE, CR0.PG, and CR4.VMXE. > --- > > They are not explicit about a non-root mode, but my experiments show they > meant exactly that. On such CPUs, KVM otherwise returns KVM_EXIT_FAIL_ENTRY, > "invalid guest state". Do you have emulate_invalid_guest_state disabled? > Question: did they really mean non-root, and if so - shouldn't KVM itself > work around such quirks? Yes, it really does include non-root. On CPUs without unrestricted guest, the world switch to non-root is less complete, for lack of a better term. Non-root without unrestricted guest requires the CPU to be in protected mode with paging enabled as the CPU isn't capable of properly virtualizing things if paging is disabled. CR4.VMXE=1 is always required, even on modern CPUs. However, those are the _hardware_ values of CR4, not the guest value of CR4, i.e. the value visible to the guest is different than the actual value in hardware. And KVM_SET_SREGS operates on the _guest_ value, KVM always has final say on the hardware value. Note, the hardware value when running in the guest (non-root) will be very different than the hardware value when running in the host (root), e.g. MCE is the only CR4 bit that is explicitly propagated to the hardware value for the guest, all other bits (including VME) are recomputed based on CPU capabilities, KVM module params, and guest state. > I wouldn't mind enabling VMXE myself, if not for the Problem 2 above, that > just disables VME then. Can KVM be somehow "fixed" to not require all these > dancing, or is there a better ways of fixing that problem? Can you rewind and describe your original problem? It sounds like you're trying to do something very specific on old hardware, encountered an error, and then built up a pile of workarounds that led you into more issues that aren't directly related to the original problem.