On Tue, Dec 05, 2023, Michael Roth wrote: > In general, activating long mode involves setting the EFER_LME bit in > the EFER register and then enabling the X86_CR0_PG bit in the CR0 > register. At this point, the EFER_LMA bit will be set automatically by > hardware. > > In the case of SVM/SEV guests where writes to CR0 are intercepted, it's > necessary for the host to set EFER_LMA on behalf of the guest since > hardware does not see the actual CR0 write. > > In the case of SEV-ES guests where writes to CR0 are trapped instead of > intercepted, the hardware *does* see/record the write to CR0 before > exiting and passing the value on to the host, so as part of enabling > SEV-ES support commit f1c6366e3043 ("KVM: SVM: Add required changes to > support intercepts under SEV-ES") dropped special handling of the > EFER_LMA bit with the understanding that it would be set automatically. > > However, since the guest never explicitly sets the EFER_LMA bit, the > host never becomes aware that it has been set. This becomes problematic > when userspace tries to get/set the EFER values via > KVM_GET_SREGS/KVM_SET_SREGS, since the EFER contents tracked by the host > will be missing the EFER_LMA bit, and when userspace attempts to pass > the EFER value back via KVM_SET_SREGS it will fail a sanity check that > asserts that EFER_LMA should always be set when X86_CR0_PG and EFER_LME > are set. > > Fix this by always inferring the value of EFER_LMA based on X86_CR0_PG > and EFER_LME, regardless of whether or not SEV-ES is enabled. Blech. This is a hack to fix even worse hacks. KVM ignores CR0/CR4/EFER values that are set via KVM_SET_SREGS, i.e. KVM is rejecting an EFER value that it will never consume, which is ridiculous. And the fact that you're not trying to have KVM actually set state further strengthens my assertion that tracking CR0/CR4/EFER in KVM is pointless necessary for SEV-ES+ guests[1]. This also fixes the _source_, which makes it far less useful, e.g. live migrating to fixed version of KVM won't work. So my very strong preference is to first skip the kvm_is_valid_sregs() check, and then in a follow-up series disable the CR0/CR4/EFER traps and probably use maximal (according to the guest's capabilities) "defaults" for CR0/CR4/EFER (and I suppose XCR0 and XSS too), i.e. assume the vCPU is in 64-bit mode with everything enabled. My understanding is that SVM_VMGEXIT_AP_CREATION is going to force KVM to assume maximal state anyways since KVM will have no way of verifying what state is actually shoved into the VMSA, i.e. emulating INIT is wildly broken[2]. Side topic, Peter suspected that KVM _does_ need to let userspace set CR8 since that's not captured in the VMSA[3]. [1] https://lore.kernel.org/all/YJla8vpwqCxqgS8C@xxxxxxxxxx [2] https://lore.kernel.org/all/20231016132819.1002933-38-michael.roth@xxxxxxx [3] https://lore.kernel.org/all/CAMkAt6oL9tfF5rvP0htbQNDPr50Zk41Q4KP-dM0N+SJ7xmsWvw@xxxxxxxxxxxxxx diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 2c924075f6f1..6fb2b913009e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -11620,7 +11620,8 @@ static int __set_sregs_common(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs, int idx; struct desc_ptr dt; - if (!kvm_is_valid_sregs(vcpu, sregs)) + if (!vcpu->arch.guest_state_protected && + !kvm_is_valid_sregs(vcpu, sregs)) return -EINVAL; apic_base_msr.data = sregs->apic_base;