On Thu, Aug 26, 2021, Maxim Levitsky wrote: > SMM return code switches CPU to real mode, and > then the nested_vmx_enter_non_root_mode first switches to vmcs02, > and then restores CR0 in the KVM register cache. > > Unfortunately when it restores the CR0, this enables the protection mode > which leads us to "restore" the segment registers from > "real mode segment cache", which is not up to date vs L2 and trips > 'vmx_guest_state_valid check' later, when the > unrestricted guest mode is not enabled. I suspect this is slightly inaccurate. When loading vmcs02, vmx_switch_vmcs() will do vmx_register_cache_reset(), which also causes the segment cache to be reset. enter_pmode() will still load stale values, but they'll come from vmcs02, not KVM's segment register cache. > This happens to work otherwise, because after we enter the nested guest, > we restore its register state again from SMRAM with correct values > and that includes the segment values. > > As a workaround to this if we enter protected mode first, > then setting CR0 won't cause this damage. > > Signed-off-by: Maxim Levitsky <mlevitsk@xxxxxxxxxx> > --- > arch/x86/kvm/vmx/vmx.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c > index 0c2c0d5ae873..805c415494cf 100644 > --- a/arch/x86/kvm/vmx/vmx.c > +++ b/arch/x86/kvm/vmx/vmx.c > @@ -7507,6 +7507,13 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate) > } > > if (vmx->nested.smm.guest_mode) { > + > + /* > + * Enter protected mode to avoid clobbering L2's segment > + * registers during nested guest entry > + */ > + vmx_set_cr0(vcpu, vcpu->arch.cr0 | X86_CR0_PE); I'd really, really, reaaaally like to avoid stuffing state. All of the instances I've come across where KVM has stuffed state for something like this were just papering over one symptom of an underlying bug. For example, won't this now cause the same bad behavior if L2 is in Real Mode? Is the problem purely that emulation_required is stale? If so, how is it stale? Every segment write as part of RSM emulation should reevaluate emulation_required via vmx_set_segment(). Oooooh, or are you talking about the explicit vmx_guest_state_valid() in prepare_vmcs02()? If that's the case, then we likely should skip that check entirely. The only part I'm not 100% clear on is whether or not it can/should be skipped for vmx_set_nested_state(). diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index bc6327950657..20bd84554c1f 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2547,7 +2547,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, * which means L1 attempted VMEntry to L2 with invalid state. * Fail the VMEntry. */ - if (CC(!vmx_guest_state_valid(vcpu))) { + if (from_vmentry && CC(!vmx_guest_state_valid(vcpu))) { *entry_failure_code = ENTRY_FAIL_DEFAULT; return -EINVAL; } If we want to retain the check for the common vmx_set_nested_state() path, i.e. when the vCPU is truly being restored to guest mode, then we can simply exempt the smm.guest_mode case (which also exempts that case when its set via vmx_set_nested_state()). The argument would be that RSM is going to restore L2 state, so whatever happens to be in vmcs12/vmcs02 is stale. diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index bc6327950657..ac30ba6a8592 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2547,7 +2547,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12, * which means L1 attempted VMEntry to L2 with invalid state. * Fail the VMEntry. */ - if (CC(!vmx_guest_state_valid(vcpu))) { + if (!vmx->nested.smm.guest_mode && CC(!vmx_guest_state_valid(vcpu))) { *entry_failure_code = ENTRY_FAIL_DEFAULT; return -EINVAL; }