Re: [PATCH] kvm: nVMX: flush TLB when decoded insn != VM-exit reason

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 13, 2020 at 12:03 PM Sean Christopherson
<sean.j.christopherson@xxxxxxxxx> wrote:
>
> On Tue, Jun 16, 2020 at 10:43:05PM +0000, Oliver Upton wrote:
> > It is possible for the instruction emulator to decode a different
> > instruction from what was implied by the VM-exit information provided by
> > hardware in vmcs02. Such is the case when the TLB entry for the guest's
> > IP is out of sync with the appropriate page-table mapping if page
> > installation isn't followed with a TLB flush.
> >
> > Currently, KVM refuses to emulate in these scenarios, instead injecting
> > a #UD into L2. While this does address the security risk of
> > CVE-2020-2732, it could result in spurious #UDs to the L2 guest. Fix
> > this by instead flushing the TLB then resuming L2, allowing hardware to
> > generate the appropriate VM-exit to be reflected into L1.
> >
> > Exceptional handling is also required for RSM and RDTSCP instructions.
> > RDTSCP could be emulated on hardware which doesn't support it,
> > therefore hardware will not generate a RDTSCP VM-exit on L2 resume. The
> > dual-monitor treatment of SMM is not supported in nVMX, which implies
> > that L0 should never handle a RSM instruction. Resuming the guest will
> > only result in another #UD. Avoid getting stuck in a loop with these
> > instructions by injecting a #UD for RSM and the appropriate VM-exit for
> > RDTSCP.
> >
> > Fixes: 07721feee46b ("KVM: nVMX: Don't emulate instructions in guest mode")
> > Cc: Paolo Bonzini <pbonzini@xxxxxxxxxx>
> > Cc: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> > Signed-off-by: Oliver Upton <oupton@xxxxxxxxxx>
> > Reviewed-by: Jim Mattson <jmattson@xxxxxxxxxx>
> > Reviewed-by: Peter Shier <pshier@xxxxxxxxxx>
> > ---
> >  arch/x86/kvm/emulate.c     |  2 ++
> >  arch/x86/kvm/kvm_emulate.h |  1 +
> >  arch/x86/kvm/vmx/vmx.c     | 68 ++++++++++++++++++++++++++++----------
> >  arch/x86/kvm/x86.c         |  2 +-
> >  4 files changed, 55 insertions(+), 18 deletions(-)
> >
> > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> > index d0e2825ae617..6e56e7a29ba1 100644
> > --- a/arch/x86/kvm/emulate.c
> > +++ b/arch/x86/kvm/emulate.c
> > @@ -5812,6 +5812,8 @@ int x86_emulate_insn(struct x86_emulate_ctxt *ctxt)
> >       }
> >       if (rc == X86EMUL_INTERCEPTED)
> >               return EMULATION_INTERCEPTED;
> > +     if (rc == X86EMUL_RETRY_INSTR)
> > +             return EMULATION_RETRY_INSTR;
> >
> >       if (rc == X86EMUL_CONTINUE)
> >               writeback_registers(ctxt);
> > diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> > index 43c93ffa76ed..5bfab8d65cd1 100644
> > --- a/arch/x86/kvm/kvm_emulate.h
> > +++ b/arch/x86/kvm/kvm_emulate.h
> > @@ -496,6 +496,7 @@ bool x86_page_table_writing_insn(struct x86_emulate_ctxt *ctxt);
> >  #define EMULATION_OK 0
> >  #define EMULATION_RESTART 1
> >  #define EMULATION_INTERCEPTED 2
> > +#define EMULATION_RETRY_INSTR 3
> >  void init_decode_cache(struct x86_emulate_ctxt *ctxt);
> >  int x86_emulate_insn(struct x86_emulate_ctxt *ctxt);
> >  int emulator_task_switch(struct x86_emulate_ctxt *ctxt,
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 08e26a9518c2..ebfafd7837ba 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7329,12 +7329,11 @@ static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
> >       to_vmx(vcpu)->req_immediate_exit = true;
> >  }
> >
> > -static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
> > -                               struct x86_instruction_info *info)
> > +static bool vmx_check_intercept_io(struct kvm_vcpu *vcpu,
> > +                                struct x86_instruction_info *info)
> >  {
> >       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> >       unsigned short port;
> > -     bool intercept;
> >       int size;
> >
> >       if (info->intercept == x86_intercept_in ||
> > @@ -7354,13 +7353,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
> >        * Otherwise, IO instruction VM-exits are controlled by the IO bitmaps.
> >        */
> >       if (!nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
> > -             intercept = nested_cpu_has(vmcs12,
> > -                                        CPU_BASED_UNCOND_IO_EXITING);
> > -     else
> > -             intercept = nested_vmx_check_io_bitmaps(vcpu, port, size);
> > +             return nested_cpu_has(vmcs12,
> > +                                   CPU_BASED_UNCOND_IO_EXITING);
> >
> > -     /* FIXME: produce nested vmexit and return X86EMUL_INTERCEPTED.  */
> > -     return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
> > +     return nested_vmx_check_io_bitmaps(vcpu, port, size);
>
> It might be a slightly bigger patch, but I think it'll be cleaner code in the
> end if this section is reordered to:
>
>         /*
>          * If the 'use IO bitmaps' VM-execution control is 1, IO instruction
>          * VM-exits are controlled by the IO bitmaps, otherwise they depend
>          * on the 'unconditional IO exiting' VM-execution control.
>          */
>         if (nested_cpu_has(vmcs12, CPU_BASED_USE_IO_BITMAPS))
>                 return nested_vmx_check_io_bitmaps(vcpu, port, size);
>
>         return nested_cpu_has(vmcs12, CPU_BASED_UNCOND_IO_EXITING);
>
> >  }
> >
> >  static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> > @@ -7369,6 +7365,7 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> >                              struct x86_exception *exception)
> >  {
> >       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > +     bool intercepted;
> >
> >       switch (info->intercept) {
> >       /*
> > @@ -7381,13 +7378,27 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> >                       exception->error_code_valid = false;
> >                       return X86EMUL_PROPAGATE_FAULT;
> >               }
> > +
> > +             intercepted = nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING);
> > +
> > +             /*
> > +              * RDTSCP could be emulated on a CPU which doesn't support it.
> > +              * As such, flushing the TLB and resuming L2 will result in
> > +              * another #UD rather than a VM-exit to reflect into L1.
> > +              * Instead, synthesize the VM-exit here.
> > +              */
> > +             if (intercepted) {
> > +                     nested_vmx_vmexit(vcpu, EXIT_REASON_RDTSCP, 0, 0);
> > +                     return X86EMUL_INTERCEPTED;
> > +             }
>
> Maybe this instead?
>
>                 if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_RDTSCP)) {
>                         exception->vector = UD_VECTOR;
>                         exception->error_code_valid = false;
>                         return X86EMUL_PROPAGATE_FAULT;
>                 } else if (nested_cpu_has(vmcs12, CPU_BASED_RDTSC_EXITING)) {
>                         /*
>                          * RDTSCP could be emulated on a CPU which doesn't
>                          * support it.  As such, flushing the TLB and resuming
>                          * L2 will result in another #UD rather than a VM-exit
>                          * to reflect into L1.  Instead, synthesize the VM-exit
>                          * here.
>                          */
>                         nested_vmx_vmexit(vcpu, EXIT_REASON_RDTSCP, 0, 0);
>                         return X86EMUL_INTERCEPTED;
>                 }
>                 intercepted = false;
>
>
> >               break;
> >
> >       case x86_intercept_in:
> >       case x86_intercept_ins:
> >       case x86_intercept_out:
> >       case x86_intercept_outs:
> > -             return vmx_check_intercept_io(vcpu, info);
> > +             intercepted = vmx_check_intercept_io(vcpu, info);
> > +             break;
> >
> >       case x86_intercept_lgdt:
> >       case x86_intercept_lidt:
> > @@ -7397,18 +7408,41 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> >       case x86_intercept_sidt:
> >       case x86_intercept_sldt:
> >       case x86_intercept_str:
> > -             if (!nested_cpu_has2(vmcs12, SECONDARY_EXEC_DESC))
> > -                     return X86EMUL_CONTINUE;
> > -
> > -             /* FIXME: produce nested vmexit and return X86EMUL_INTERCEPTED.  */
> > +             intercepted = nested_cpu_has2(vmcs12, SECONDARY_EXEC_DESC);
> >               break;
> >
> > -     /* TODO: check more intercepts... */
> > +     /*
> > +      * The dual-monitor treatment of SMM is not supported in nVMX. As such,
> > +      * L0 will never handle the RSM instruction nor should it retry
> > +      * instruction execution. Instead, a #UD should be injected into the
> > +      * guest for the execution of RSM outside of SMM.
> > +      */
> > +     case x86_intercept_rsm:
> > +             exception->vector = UD_VECTOR;
> > +             exception->error_code_valid = false;
> > +             return X86EMUL_PROPAGATE_FAULT;
>
> Why does RSM need special treatment?  Won't it just naturally #UD if we fall
> through to the flush-and-retry path?
>
> >       default:
> > -             break;
> > +             intercepted = true;
> >       }
> >
> > -     return X86EMUL_UNHANDLEABLE;
> > +     if (!intercepted)
> > +             return X86EMUL_CONTINUE;
> > +
> > +     /*
> > +      * The only uses of the emulator in VMX for instructions which may be
> > +      * intercepted are port IO instructions, descriptor-table accesses, and
> > +      * the RDTSCP instruction. As such, if the emulator has decoded an
>
> I wouldn't list out the individual cases, it's pretty obvious what can be
> emulated by looking at the above code, and inevitably something will be added
> that requires updating this comment.
>
> > +      * instruction that is different from the VM-exit provided by hardware
> > +      * it is likely that the TLB entry and page-table mapping for the
>
> Probaby better to avoid talking about "page-table mapping", because it's not
> clear which page tables are being referenced.
>
> > +      * guest's RIP are out of sync.
>
> Maybe something like:
>
>         /*
>          * There are very few instructions that KVM will emulate for L2 and can
>          * also be intercepted by l1.  If the emulator decoded an instruction
>          * that is different from the VM-exit provided by hardware, the TLB
>          * entry for guest's RIP is likely stale.  Rather than synthesizing a
>          * VM-exit into L1 for every possible instruction, just flush the TLB,
>          * resume L2, and let hardware generate the appropriate VM-exit.
>          */
>
> > +      *
> > +      * Rather than synthesizing a VM-exit into L1 for every possible
> > +      * instruction just flush the TLB, resume L2, and let hardware generate
> > +      * the appropriate VM-exit.
> > +      */
> > +     vmx_flush_tlb_gva(vcpu, kvm_rip_read(vcpu));
>
> This is wrong, it should flush kvm_get_linear_rip(vcpu).
>

I do not believe that the aim of this patch will work anymore, since:

1dbf5d68af6f ("KVM: VMX: Add guest physical address check in EPT
violation and misconfig")

Since it is possible to get into the emulator on any instruction that
induces an EPT violation, we'd wind up looping when we believe the
instruction needs to exit to L1 (TLB flush, resume guest, hit the same
EPT violation. Rinse, wash, repeat).

> > +     return X86EMUL_RETRY_INSTR;
> >  }
> >
> >  #ifdef CONFIG_X86_64
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 00c88c2f34e4..2ab47485100f 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -6967,7 +6967,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> >
> >       r = x86_emulate_insn(ctxt);
> >
> > -     if (r == EMULATION_INTERCEPTED)
> > +     if (r == EMULATION_INTERCEPTED || r == EMULATION_RETRY_INSTR)
> >               return 1;
> >
> >       if (r == EMULATION_FAILED) {
> > --
> > 2.27.0.290.gba653c62da-goog
> >



[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux