On 08/01/2018 21:00, Liran Alon wrote: > > > On 08/01/18 20:08, Paolo Bonzini wrote: >> From: Tom Lendacky <thomas.lendacky@xxxxxxx> >> >> Set IBPB (Indirect Branch Prediction Barrier) when the current CPU is >> going to run a VCPU different from what was previously run. Nested >> virtualization uses the same VMCB for the second level guest, but the >> L1 hypervisor should be using IBRS to protect itself. >> >> Signed-off-by: Tom Lendacky <thomas.lendacky@xxxxxxx> >> Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> >> --- >> arch/x86/kvm/svm.c | 31 +++++++++++++++++++++++++++++++ >> 1 file changed, 31 insertions(+) >> >> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c >> index 779981a00d01..dd744d896cec 100644 >> --- a/arch/x86/kvm/svm.c >> +++ b/arch/x86/kvm/svm.c >> @@ -289,6 +289,7 @@ struct amd_svm_iommu_ir { >> module_param(vgif, int, 0444); >> >> static bool __read_mostly have_spec_ctrl; >> +static bool __read_mostly have_ibpb_support; >> >> static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0); >> static void svm_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa); >> @@ -540,6 +541,7 @@ struct svm_cpu_data { >> struct kvm_ldttss_desc *tss_desc; >> >> struct page *save_area; >> + struct vmcb *current_vmcb; >> }; >> >> static DEFINE_PER_CPU(struct svm_cpu_data *, svm_data); >> @@ -1151,6 +1153,11 @@ static __init int svm_hardware_setup(void) >> pr_info("kvm: SPEC_CTRL available\n"); >> else >> pr_info("kvm: SPEC_CTRL not available\n"); >> + have_ibpb_support = have_spec_ctrl || cpu_has_ibpb_support(); >> + if (have_ibpb_support) >> + pr_info("kvm: IBPB_SUPPORT available\n"); >> + else >> + pr_info("kvm: IBPB_SUPPORT not available\n"); >> >> return 0; >> >> @@ -1725,11 +1732,19 @@ static void svm_free_vcpu(struct kvm_vcpu *vcpu) >> __free_pages(virt_to_page(svm->nested.msrpm), MSRPM_ALLOC_ORDER); >> kvm_vcpu_uninit(vcpu); >> kmem_cache_free(kvm_vcpu_cache, svm); >> + >> + /* >> + * The VMCB could be recycled, causing a false negative in >> + * svm_vcpu_load; block speculative execution. >> + */ >> + if (have_ibpb_support) >> + wrmsrl(MSR_IA32_PRED_CMD, FEATURE_SET_IBPB); >> } >> >> static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu) >> { >> struct vcpu_svm *svm = to_svm(vcpu); >> + struct svm_cpu_data *sd = per_cpu(svm_data, cpu); >> int i; >> >> if (unlikely(cpu != vcpu->cpu)) { >> @@ -1758,6 +1773,12 @@ static void svm_vcpu_load(struct kvm_vcpu >> *vcpu, int cpu) >> if (static_cpu_has(X86_FEATURE_RDTSCP)) >> wrmsrl(MSR_TSC_AUX, svm->tsc_aux); >> >> + if (sd->current_vmcb != svm->vmcb) { >> + sd->current_vmcb = svm->vmcb; >> + if (have_ibpb_support) >> + wrmsrl(MSR_IA32_PRED_CMD, FEATURE_SET_IBPB); >> + } >> + >> avic_vcpu_load(vcpu, cpu); >> } >> >> @@ -2798,6 +2819,11 @@ static int nested_svm_vmexit(struct vcpu_svm *svm) >> if (!nested_vmcb) >> return 1; >> >> + /* >> + * No need for IBPB here, the L1 hypervisor should be running with >> + * IBRS=1 and inserts one already when switching L2 VMs. >> + */ >> + > > I am not fully convinced yet this is OK from security perspective. > From the CPU point-of-view, both L1 & L2 run in the same prediction-mode > (as they are both running in SVM guest-mode). Therefore, IBRS will not > hide the BHB & BTB of L1 from L2. Indeed, IBRS will not hide the indirect branch predictor state of L2 user mode from L1 user mode. On current generation processors, as I understand it, IBRS simply disables the indirect branch predictor when set to 1. Therefore, as long as the L1 hypervisor sets IBRS=1, the indirect branch predictor state left by L2 does not affect execution of the L1 hypervisor. Future processors might have a mode that says "just set IBRS=1 and I'll DTRT". If that's done by indexing the BTB using the full virtual address, the CPL _and_ the ASID/PCID/VPID, then IBPB is not needed here because the L2 VM uses a different ASID. If that's done by only augmenting the BTB index with the CPL, we'd need an IBPB. But in this new world we've been thrown into, that would be a processor bug... Note that AMD currently doesn't implement IBRS at all. In order to mitigate against CVE-2017-5715, the hypervisor should issue an IBPB on every vmexit (in both L0 and L1). However, the cost of that is very high. While we may want a paranoia mode that does it, it is outside the scope of this series (which is more of a "let's unblock QEMU patches" thing than a complete fix). > 6. One can implicitly assume that L1 hypervisor did issued a IBPB when > loading an VMCB and therefore it doesn't have to do it again in L0. > However, I see 2 problems with this approach: > (a) We don't protect old non-patched L1 hypervisors from Spectre even > though we could easily do that from L0 with small performance hit. Yeah, that's a nice thing to have. However, I disagree on the "small" performance hit... on a patched hypervisor, the cost of a double fix can be very noticeable. > (b) When L1 hypervisor runs only a single L2 VM, it doesn't have to > issue an IBPB when loading the VMCB (as it didn't run any other VMCB > before) and it should be sufficient from L1 perspective to just write 1 > to IBRS. However, in our nested-case, this is a security-hole. This is the main difference in our reasoning. I think IBRS=1 (or IBPB on vmexit) should be enough. Paolo > Am I misunderstanding something? > > Regards, > -Liran > >> /* Exit Guest-Mode */ >> leave_guest_mode(&svm->vcpu); >> svm->nested.vmcb = 0; >> @@ -3061,6 +3087,11 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm) >> if (!nested_vmcb) >> return false; >> >> + /* >> + * No need for IBPB here, since the nested VM is less >> privileged. The >> + * L1 hypervisor inserts one already when switching L2 VMs. >> + */ >> + >> if (!nested_vmcb_checks(nested_vmcb)) { >> nested_vmcb->control.exit_code = SVM_EXIT_ERR; >> nested_vmcb->control.exit_code_hi = 0; >> >