Re: [PATCH] nVMX: Defer error from VM-entry MSR-load area to until after hardware verifies VMCS guest state-area

Jim Mattson <jmattson@xxxxxxxxxx> · Wed, 2 Oct 2019 10:32:36 -0700

On Wed, Oct 2, 2019 at 10:16 AM Krish Sadhukhan
<krish.sadhukhan@xxxxxxxxxx> wrote:
>
>
> On 10/1/19 3:02 PM, Jim Mattson wrote:
> > On Tue, Oct 1, 2019 at 2:23 PM Krish Sadhukhan
> > <krish.sadhukhan@xxxxxxxxxx>  wrote:
> >>
> >> On 10/01/2019 10:09 AM, Jim Mattson wrote:
> >>> On Mon, Sep 30, 2019 at 5:12 PM Krish Sadhukhan
> >>> <krish.sadhukhan@xxxxxxxxxx>  wrote:
> >>>> According to section “VM Entries” in Intel SDM vol 3C, VM-entry checks are
> >>>> performed in a certain order. Checks on MSRs that are loaded on VM-entry
> >>>> from VM-entry MSR-load area, should be done after verifying VMCS controls,
> >>>> host-state area and guest-state area. As KVM relies on CPU hardware to
> >>>> perform some of these checks, we need to defer VM-exit due to invalid
> >>>> VM-entry MSR-load area to until after CPU hardware completes the earlier
> >>>> checks and is ready to do VMLAUNCH/VMRESUME.
> >>>>
> >>>> In order to defer errors arising from invalid VM-entry MSR-load area in
> >>>> vmcs12, we set up a single invalid entry, which is illegal according to
> >>>> section "Loading MSRs in Intel SDM vol 3C, in VM-entry MSR-load area of
> >>>> vmcs02. This will cause the CPU hardware to VM-exit with "VM-entry failure
> >>>> due to MSR loading" after it completes checks on VMCS controls, host-state
> >>>> area and guest-state area. We reflect a synthesized Exit Qualification to
> >>>> our guest.
> >>> This change addresses the priority inversion, but it still potentially
> >>> leaves guest MSRs incorrectly loaded with values from the VMCS12
> >>> VM-entry MSR-load area when a higher priority error condition would
> >>> have precluded any processing of the VM-entry MSR-load area.
> >> May be, we should not load any guest MSR until we have checked the
> >> entire vmcs12 MSR-load area in nested_vmx_load_msr()  ?
> > That's not sufficient. We shouldn't load any guest MSR until all of
> > the checks on host state, controls, and guest state have passed. Since
> > we defer the guest state checks to hardware, that makes the proper
> > ordering a bit problematic. Instead, you can try to arrange to undo
> > the guest MSR loads if a higher priority error is discovered. That
> > won't be trivial.
>
>
> We discussed about undoing guest MSR loads in an earlier thread. You
> said that some MSR updates couldn't be rolled back. If we can't rollback
> some MSR updates and if that leads to an undefined guest configuration,
> may be we should return to L0 and mark it as a hardware failure, like
> what we currently do for some VM-entry failures in vmx_handle_exit():

The MSR loads that can't be rolled back are mainly the ones with
side-effects, like IA32_PRED_CMD. I think these are mostly benign
(except, perhaps, from a performance perspective). I think it's worth
investigating rollback as a partial solution to this problem.

>
>      if (exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) {
>                  dump_vmcs();
>                  vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
> vcpu->run->fail_entry.hardware_entry_failure_reason
>                          = exit_reason;
>                  return 0;
>          }
>
>
> >>>> Suggested-by: Jim Mattson<jmattson@xxxxxxxxxx>
> >>>> Signed-off-by: Krish Sadhukhan<krish.sadhukhan@xxxxxxxxxx>
> >>>> Reviewed-by: Mihai Carabas<mihai.carabas@xxxxxxxxxx>
> >>>> Reviewed-by: Liran Alon<liran.alon@xxxxxxxxxx>
> >>>> ---
> >>>>    arch/x86/kvm/vmx/nested.c | 34 +++++++++++++++++++++++++++++++---
> >>>>    arch/x86/kvm/vmx/nested.h | 14 ++++++++++++--
> >>>>    arch/x86/kvm/vmx/vmcs.h   |  6 ++++++
> >>>>    3 files changed, 49 insertions(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> >>>> index ced9fba32598..b74491c04090 100644
> >>>> --- a/arch/x86/kvm/vmx/nested.c
> >>>> +++ b/arch/x86/kvm/vmx/nested.c
> >>>> @@ -3054,12 +3054,40 @@ int nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu, bool from_vmentry)
> >>>>                   goto vmentry_fail_vmexit_guest_mode;
> >>>>
> >>>>           if (from_vmentry) {
> >>>> -               exit_reason = EXIT_REASON_MSR_LOAD_FAIL;
> >>>>                   exit_qual = nested_vmx_load_msr(vcpu,
> >>>>                                                   vmcs12->vm_entry_msr_load_addr,
> >>>>                                                   vmcs12->vm_entry_msr_load_count);
> >>>> -               if (exit_qual)
> >>>> -                       goto vmentry_fail_vmexit_guest_mode;
> >>>> +               if (exit_qual) {
> >>>> +                       /*
> >>>> +                        * According to section “VM Entries” in Intel SDM
> >>>> +                        * vol 3C, VM-entry checks are performed in a certain
> >>>> +                        * order. Checks on MSRs that are loaded on VM-entry
> >>>> +                        * from VM-entry MSR-load area, should be done after
> >>>> +                        * verifying VMCS controls, host-state area and
> >>>> +                        * guest-state area. As KVM relies on CPU hardware to
> >>>> +                        * perform some of these checks, we need to defer
> >>>> +                        * VM-exit due to invalid VM-entry MSR-load area to
> >>>> +                        * until after CPU hardware completes the earlier
> >>>> +                        * checks and is ready to do VMLAUNCH/VMRESUME.
> >>>> +                        *
> >>>> +                        * In order to defer errors arising from invalid
> >>>> +                        * VM-entry MSR-load area in vmcs12, we set up a
> >>>> +                        * single invalid entry, which is illegal according
> >>>> +                        * to section "Loading MSRs in Intel SDM vol 3C, in
> >>>> +                        * VM-entry MSR-load area of vmcs02. This will cause
> >>>> +                        * the CPU hardware to VM-exit with "VM-entry
> >>>> +                        * failure due to MSR loading" after it completes
> >>>> +                        * checks on VMCS controls, host-state area and
> >>>> +                        * guest-state area.
> >>>> +                        */
> >>>> +                       vmx->loaded_vmcs->invalid_msr_load_area.index =
> >>>> +                           MSR_FS_BASE;
> >>> Can this field be statically populated during initialization?
> >> Yes.
> >>
> >>>> +                       vmx->loaded_vmcs->invalid_msr_load_area.value =
> >>>> +                           exit_qual;
> >>> This seems awkward. Why not save 16 bytes per loaded_vmcs by
> >>> allocating one invalid_msr_load_area system-wide and just add a 4 byte
> >>> field to struct nested_vmx to store this value?
> >> OK.
> >>
> >>>> +                       vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, 1);
> >>>> +                       vmcs_write64(VM_ENTRY_MSR_LOAD_ADDR,
> >>>> +                           __pa(&(vmx->loaded_vmcs->invalid_msr_load_area)));
> >>>> +               }
> >>> Do you need to set vmx->nested.dirty_vmcs12 to ensure that
> >>> prepare_vmcs02_constant_state() will be called at the next emulated
> >>> VM-entry, to undo this change to VM_ENTRY_MSR_LOAD_ADDR?
> >> Yes.
> >>
> >>>>           } else {
> >>>>                   /*
> >>>>                    * The MMU is not initialized to point at the right entities yet and
> >>>> diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
> >>>> index 187d39bf0bf1..f3a384235b68 100644
> >>>> --- a/arch/x86/kvm/vmx/nested.h
> >>>> +++ b/arch/x86/kvm/vmx/nested.h
> >>>> @@ -64,7 +64,9 @@ static inline bool nested_ept_ad_enabled(struct kvm_vcpu *vcpu)
> >>>>    static inline int nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu,
> >>>>                                               u32 exit_reason)
> >>>>    {
> >>>> +       u32 exit_qual;
> >>>>           u32 exit_intr_info = vmcs_read32(VM_EXIT_INTR_INFO);
> >>>> +       struct vcpu_vmx *vmx = to_vmx(vcpu);
> >>>>
> >>>>           /*
> >>>>            * At this point, the exit interruption info in exit_intr_info
> >>>> @@ -81,8 +83,16 @@ static inline int nested_vmx_reflect_vmexit(struct kvm_vcpu *vcpu,
> >>>>                           vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
> >>>>           }
> >>>>
> >>>> -       nested_vmx_vmexit(vcpu, exit_reason, exit_intr_info,
> >>>> -                         vmcs_readl(EXIT_QUALIFICATION));
> >>>> +       exit_qual = vmcs_readl(EXIT_QUALIFICATION);
> >>>> +
> >>>> +       if (vmx->loaded_vmcs->invalid_msr_load_area.index == MSR_FS_BASE &&
> >>>> +           (exit_reason == (VMX_EXIT_REASONS_FAILED_VMENTRY |
> >>>> +                           EXIT_REASON_MSR_LOAD_FAIL))) {
> >>> Is the second conjunct sufficient? i.e. Isn't there a bug in kvm if
> >>> the second conjunct is true and the first is not?
> >> If the first conjunct isn't true and the second one is, it means it's a
> >> hardware-detected MSR-load error. Right ? So won't that be handled the
> >> same way it is handled currently in KVM ?
> > I believe that KVM will currently forward such an error to L1, which
> > is wrong, since L1 has no control over vmcs02's VM-entry MSR-load
> > area.
>
>
> You are right. I hadn't noticed it until you mentioned.
>
>
> >   The assumption is that this will never happen, because L0 is too
> > careful.
> >>>> +               exit_qual = vmx->loaded_vmcs->invalid_msr_load_area.value;
> >>>> +       }
> >>>> +
> >>>> +       nested_vmx_vmexit(vcpu, exit_reason, exit_intr_info, exit_qual);
> >>>> +
> >>>>           return 1;
> >>>>    }
> >>>>
> >>>> diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
> >>>> index 481ad879197b..e272788bd4b8 100644
> >>>> --- a/arch/x86/kvm/vmx/vmcs.h
> >>>> +++ b/arch/x86/kvm/vmx/vmcs.h
> >>>> @@ -70,6 +70,12 @@ struct loaded_vmcs {
> >>>>           struct list_head loaded_vmcss_on_cpu_link;
> >>>>           struct vmcs_host_state host_state;
> >>>>           struct vmcs_controls_shadow controls_shadow;
> >>>> +       /*
> >>>> +        * This field is used to set up an invalid VM-entry MSR-load area
> >>>> +        * for vmcs02 if an error is detected while processing the entries
> >>>> +        * in VM-entry MSR-load area of vmcs12.
> >>>> +        */
> >>>> +       struct vmx_msr_entry invalid_msr_load_area;
> >>>>    };
> >>> I'd suggest allocating just one invalid_msr_load_area system-wide, as
> >>> mentioned above.
> >>>
> >>>>    static inline bool is_exception_n(u32 intr_info, u8 vector)
> >>>> --
> >>>> 2.20.1
> >>>>