On Wed, Dec 06, 2023 at 07:20:14PM +0200, Maxim Levitsky wrote: > On Tue, 2023-12-05 at 16:28 -0600, Michael Roth wrote: > > Commit 7191f24c7fcf ("accel/kvm/kvm-all: Handle register access errors") > > added error checking for KVM_SET_SREGS/KVM_SET_SREGS2. In doing so, it > > exposed a long-running bug in current KVM support for SEV-ES where the > > kernel assumes that MSR_EFER_LMA will be set explicitly by the guest > > kernel, in which case EFER write traps would result in KVM eventually > > seeing MSR_EFER_LMA get set and recording it in such a way that it would > > be subsequently visible when accessing it via KVM_GET_SREGS/etc. > > > > However, guests kernels currently rely on MSR_EFER_LMA getting set > > automatically when MSR_EFER_LME is set and paging is enabled via > > CR0_PG_MASK. As a result, the EFER write traps don't actually expose the > > MSR_EFER_LMA even though it is set internally, and when QEMU > > subsequently tries to pass this EFER value back to KVM via > > KVM_SET_SREGS* it will fail various sanity checks and return -EINVAL, > > which is now considered fatal due to the aforementioned QEMU commit. > > > > This can be addressed by inferring the MSR_EFER_LMA bit being set when > > paging is enabled and MSR_EFER_LME is set, and synthesizing it to ensure > > the expected bits are all present in subsequent handling on the host > > side. > > > > Ultimately, this handling will be implemented in the host kernel, but to > > avoid breaking QEMU's SEV-ES support when using older host kernels, the > > same handling can be done in QEMU just after fetching the register > > values via KVM_GET_SREGS*. Implement that here. > > > > Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx> > > Cc: Marcelo Tosatti <mtosatti@xxxxxxxxxx> > > Cc: Tom Lendacky <thomas.lendacky@xxxxxxx> > > Cc: Akihiko Odaki <akihiko.odaki@xxxxxxxxxx> > > Cc: kvm@xxxxxxxxxxxxxxx > > Fixes: 7191f24c7fcf ("accel/kvm/kvm-all: Handle register access errors") > > Signed-off-by: Michael Roth <michael.roth@xxxxxxx> > > --- > > v2: > > - Add handling for KVM_GET_SREGS, not just KVM_GET_SREGS2 > > > > target/i386/kvm/kvm.c | 14 ++++++++++++++ > > 1 file changed, 14 insertions(+) > > > > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c > > index 11b8177eff..8721c1bf8f 100644 > > --- a/target/i386/kvm/kvm.c > > +++ b/target/i386/kvm/kvm.c > > @@ -3610,6 +3610,7 @@ static int kvm_get_sregs(X86CPU *cpu) > > { > > CPUX86State *env = &cpu->env; > > struct kvm_sregs sregs; > > + target_ulong cr0_old; > > int ret; > > > > ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_SREGS, &sregs); > > @@ -3637,12 +3638,18 @@ static int kvm_get_sregs(X86CPU *cpu) > > env->gdt.limit = sregs.gdt.limit; > > env->gdt.base = sregs.gdt.base; > > > > + cr0_old = env->cr[0]; > > env->cr[0] = sregs.cr0; > > env->cr[2] = sregs.cr2; > > env->cr[3] = sregs.cr3; > > env->cr[4] = sregs.cr4; > > > > env->efer = sregs.efer; > > + if (sev_es_enabled() && env->efer & MSR_EFER_LME) { > > + if (!(cr0_old & CR0_PG_MASK) && env->cr[0] & CR0_PG_MASK) { > > + env->efer |= MSR_EFER_LMA; > > + } > > + } > > I think that we should not check that CR0_PG has changed, and just blindly assume > that if EFER.LME is set and CR0.PG is set, then EFER.LMA must be set as defined in x86 spec. > > Otherwise, suppose qemu calls kvm_get_sregs twice: First time it will work, > but second time CR0.PG will match one that is stored in the env, and thus the workaround > will not be executed, and instead we will revert back to wrong EFER value > reported by the kernel. > > How about something like that: > > > if (sev_es_enabled() && env->efer & MSR_EFER_LME && env->cr[0] & CR0_PG_MASK) { > /* > * Workaround KVM bug, because of which KVM might not be aware of the > * fact that EFER.LMA was toggled by the hardware > */ > env->efer |= MSR_EFER_LMA; > } Hi Maxim, I'd already sent a v3 based on a similar suggestion from Paolo: https://lists.gnu.org/archive/html/qemu-devel/2023-12/msg00751.html Does that one look okay to you? Thanks, Mike > > > Best regards, > Maxim Levitsky > > > > > /* changes to apic base and cr8/tpr are read back via kvm_arch_post_run */ > > x86_update_hflags(env); > > @@ -3654,6 +3661,7 @@ static int kvm_get_sregs2(X86CPU *cpu) > > { > > CPUX86State *env = &cpu->env; > > struct kvm_sregs2 sregs; > > + target_ulong cr0_old; > > int i, ret; > > > > ret = kvm_vcpu_ioctl(CPU(cpu), KVM_GET_SREGS2, &sregs); > > @@ -3676,12 +3684,18 @@ static int kvm_get_sregs2(X86CPU *cpu) > > env->gdt.limit = sregs.gdt.limit; > > env->gdt.base = sregs.gdt.base; > > > > + cr0_old = env->cr[0]; > > env->cr[0] = sregs.cr0; > > env->cr[2] = sregs.cr2; > > env->cr[3] = sregs.cr3; > > env->cr[4] = sregs.cr4; > > > > env->efer = sregs.efer; > > + if (sev_es_enabled() && env->efer & MSR_EFER_LME) { > > + if (!(cr0_old & CR0_PG_MASK) && env->cr[0] & CR0_PG_MASK) { > > + env->efer |= MSR_EFER_LMA; > > + } > > + } > > > > env->pdptrs_valid = sregs.flags & KVM_SREGS2_FLAGS_PDPTRS_VALID; > > > >