> On May 12, 2022, at 4:07 PM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Thu, May 12, 2022, Jon Kohler wrote: >> >> >>> On May 12, 2022, at 3:35 PM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: >>> >>> On Thu, May 12, 2022, Sean Christopherson wrote: >>>> On Thu, May 12, 2022, Jon Kohler wrote: >>>>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest >>>>> attack surface is already covered by switch_mm_irqs_off() -> >>>>> cond_mitigation(). >>>>> >>>>> The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply >>>>> wrong in its guest-to-guest design intention. There are three scenarios >>>>> at play here: >>>> >>>> Jim pointed offline that there's a case we didn't consider. When switching between >>>> vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in >>>> different security domains. E.g. the guest will not get a notification that vCPU0 is >>>> being swapped out for vCPU1 on a single pCPU. >>>> >>>> So, sadly, after all that, I think the IBPB needs to stay. But the documentation >>>> most definitely needs to be updated. >>>> >>>> A per-VM capability to skip the IBPB may be warranted, e.g. for container-like >>>> use cases where a single VM is running a single workload. >>> >>> Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, >>> because then the IBPB is fully redundant with respect to any IBPB performed by >>> switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just >>> because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. >>> >>> That would also sidestep the largely theoretical question of whether vCPUs from >>> different VMs but the same address space are in the same security domain. It doesn't >>> matter, because even if they are in the same domain, KVM still needs to do IBPB. >> >> So should we go back to the earlier approach where we have it be only >> IBPB on always_ibpb? Or what? >> >> At minimum, we need to fix the unilateral-ness of all of this :) since we’re >> IBPB’ing even when the user did not explicitly tell us to. > > I think we need separate controls for the guest. E.g. if the userspace VMM is > sufficiently hardened then it can run without "do IBPB" flag, but that doesn't > mean that the entire guest it's running is sufficiently hardened. What if we keyed off MSR bitmap, such that if a guest *ever* issued IBPB, KVM can do IBPB on switch? We already disable interception today, so we have the data, just like we do for SPEC_CTRL. if (prev != vmx->loaded_vmcs->vmcs) { per_cpu(current_vmcs, cpu) = vmx->loaded_vmcs->vmcs; vmcs_load(vmx->loaded_vmcs->vmcs); /* * No indirect branch prediction barrier needed when switching * the active VMCS within a guest, e.g. on nested VM-Enter. * The L1 VMM can protect itself with retpolines, IBPB or IBRS. * We'll only issue this IBPB if the guest itself has ever issued * an IBPB, which would indicate they care about prediction barriers * on one or more task(s) within the guest. This guards against the * scenario where the guest has separate security domains on separate * vCPUs, and the kernel switches vCPU-x out for vCPU-y on the same * pCPU, before the guest has the chance to issue its own barrier. * In this scenario, the switch_mm() -> cond_mitigation would not * issue its own barrier, because the vCPUs are sharing a mm_struct. */ if ((!buddy || WARN_ON_ONCE(buddy->vmcs != prev)) && !msr_write_intercepted(vmx, MSR_IA32_PRED_CMD)) indirect_branch_prediction_barrier() } If the guest isn’t ever issuing IBPB, they one could say that they do not care about vCPU-to-vCPU attack surface. Thoughts? > >> That said, since I just re-read the documentation today, it does specifically >> suggest that if the guest wants to protect *itself* it should turn on IBPB or >> STIBP (or other mitigations galore), so I think we end up having to think >> about what our “contract” is with users who host their workloads on >> KVM - are they expecting us to protect them in any/all cases? >> >> Said another way, the internal guest areas of concern aren’t something >> the kernel would always be able to A) identify far in advance and B) >> always solve on the users behalf. There is an argument to be made >> that the guest needs to deal with its own house, yea? > > The issue is that the guest won't get a notification if vCPU0 is replaced with > vCPU1 on the same physical CPU, thus the guest doesn't get an opportunity to emit > IBPB. Since the host doesn't know whether or not the guest wants IBPB, unless the > owner of the host is also the owner of the guest workload, the safe approach is to > assume the guest is vulnerable.