On Thu, May 12, 2022 at 1:34 PM Jon Kohler <jon@xxxxxxxxxxx> wrote: > > > > > On May 12, 2022, at 4:27 PM, Jim Mattson <jmattson@xxxxxxxxxx> wrote: > > > > On Thu, May 12, 2022 at 1:07 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > >> > >> On Thu, May 12, 2022, Jon Kohler wrote: > >>> > >>> > >>>> On May 12, 2022, at 3:35 PM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > >>>> > >>>> On Thu, May 12, 2022, Sean Christopherson wrote: > >>>>> On Thu, May 12, 2022, Jon Kohler wrote: > >>>>>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest > >>>>>> attack surface is already covered by switch_mm_irqs_off() -> > >>>>>> cond_mitigation(). > >>>>>> > >>>>>> The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply > >>>>>> wrong in its guest-to-guest design intention. There are three scenarios > >>>>>> at play here: > >>>>> > >>>>> Jim pointed offline that there's a case we didn't consider. When switching between > >>>>> vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in > >>>>> different security domains. E.g. the guest will not get a notification that vCPU0 is > >>>>> being swapped out for vCPU1 on a single pCPU. > >>>>> > >>>>> So, sadly, after all that, I think the IBPB needs to stay. But the documentation > >>>>> most definitely needs to be updated. > >>>>> > >>>>> A per-VM capability to skip the IBPB may be warranted, e.g. for container-like > >>>>> use cases where a single VM is running a single workload. > >>>> > >>>> Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, > >>>> because then the IBPB is fully redundant with respect to any IBPB performed by > >>>> switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just > >>>> because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. > >>>> > >>>> That would also sidestep the largely theoretical question of whether vCPUs from > >>>> different VMs but the same address space are in the same security domain. It doesn't > >>>> matter, because even if they are in the same domain, KVM still needs to do IBPB. > >>> > >>> So should we go back to the earlier approach where we have it be only > >>> IBPB on always_ibpb? Or what? > >>> > >>> At minimum, we need to fix the unilateral-ness of all of this :) since we’re > >>> IBPB’ing even when the user did not explicitly tell us to. > >> > >> I think we need separate controls for the guest. E.g. if the userspace VMM is > >> sufficiently hardened then it can run without "do IBPB" flag, but that doesn't > >> mean that the entire guest it's running is sufficiently hardened. > >> > >>> That said, since I just re-read the documentation today, it does specifically > >>> suggest that if the guest wants to protect *itself* it should turn on IBPB or > >>> STIBP (or other mitigations galore), so I think we end up having to think > >>> about what our “contract” is with users who host their workloads on > >>> KVM - are they expecting us to protect them in any/all cases? > >>> > >>> Said another way, the internal guest areas of concern aren’t something > >>> the kernel would always be able to A) identify far in advance and B) > >>> always solve on the users behalf. There is an argument to be made > >>> that the guest needs to deal with its own house, yea? > >> > >> The issue is that the guest won't get a notification if vCPU0 is replaced with > >> vCPU1 on the same physical CPU, thus the guest doesn't get an opportunity to emit > >> IBPB. Since the host doesn't know whether or not the guest wants )IBPB, unless the > >> owner of the host is also the owner of the guest workload, the safe approach is to > >> assume the guest is vulnerable. > > > > Exactly. And if the guest has used taskset as its mitigation strategy, > > how is the host to know? > > Yea thats fair enough. I posed a solution on Sean’s response just as this email > came in, would love to know your thoughts (keying off MSR bitmap). > I don't believe this works. The IBPBs in cond_mitigation (static in arch/x86/mm/tlb.c) won't be triggered if the guest has given its sensitive tasks exclusive use of their cores. And, if performance is a concern, that is exactly what someone would do.