> On May 12, 2022, at 7:57 PM, Jim Mattson <jmattson@xxxxxxxxxx> wrote: > > On Thu, May 12, 2022 at 1:34 PM Jon Kohler <jon@xxxxxxxxxxx> wrote: >> >> >> >>> On May 12, 2022, at 4:27 PM, Jim Mattson <jmattson@xxxxxxxxxx> wrote: >>> >>> On Thu, May 12, 2022 at 1:07 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: >>>> >>>> On Thu, May 12, 2022, Jon Kohler wrote: >>>>> >>>>> >>>>>> On May 12, 2022, at 3:35 PM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote: >>>>>> >>>>>> On Thu, May 12, 2022, Sean Christopherson wrote: >>>>>>> On Thu, May 12, 2022, Jon Kohler wrote: >>>>>>>> Remove IBPB that is done on KVM vCPU load, as the guest-to-guest >>>>>>>> attack surface is already covered by switch_mm_irqs_off() -> >>>>>>>> cond_mitigation(). >>>>>>>> >>>>>>>> The original commit 15d45071523d ("KVM/x86: Add IBPB support") was simply >>>>>>>> wrong in its guest-to-guest design intention. There are three scenarios >>>>>>>> at play here: >>>>>>> >>>>>>> Jim pointed offline that there's a case we didn't consider. When switching between >>>>>>> vCPUs in the same VM, an IBPB may be warranted as the tasks in the VM may be in >>>>>>> different security domains. E.g. the guest will not get a notification that vCPU0 is >>>>>>> being swapped out for vCPU1 on a single pCPU. >>>>>>> >>>>>>> So, sadly, after all that, I think the IBPB needs to stay. But the documentation >>>>>>> most definitely needs to be updated. >>>>>>> >>>>>>> A per-VM capability to skip the IBPB may be warranted, e.g. for container-like >>>>>>> use cases where a single VM is running a single workload. >>>>>> >>>>>> Ah, actually, the IBPB can be skipped if the vCPUs have different mm_structs, >>>>>> because then the IBPB is fully redundant with respect to any IBPB performed by >>>>>> switch_mm_irqs_off(). Hrm, though it might need a KVM or per-VM knob, e.g. just >>>>>> because the VMM doesn't want IBPB doesn't mean the guest doesn't want IBPB. >>>>>> >>>>>> That would also sidestep the largely theoretical question of whether vCPUs from >>>>>> different VMs but the same address space are in the same security domain. It doesn't >>>>>> matter, because even if they are in the same domain, KVM still needs to do IBPB. >>>>> >>>>> So should we go back to the earlier approach where we have it be only >>>>> IBPB on always_ibpb? Or what? >>>>> >>>>> At minimum, we need to fix the unilateral-ness of all of this :) since we’re >>>>> IBPB’ing even when the user did not explicitly tell us to. >>>> >>>> I think we need separate controls for the guest. E.g. if the userspace VMM is >>>> sufficiently hardened then it can run without "do IBPB" flag, but that doesn't >>>> mean that the entire guest it's running is sufficiently hardened. >>>> >>>>> That said, since I just re-read the documentation today, it does specifically >>>>> suggest that if the guest wants to protect *itself* it should turn on IBPB or >>>>> STIBP (or other mitigations galore), so I think we end up having to think >>>>> about what our “contract” is with users who host their workloads on >>>>> KVM - are they expecting us to protect them in any/all cases? >>>>> >>>>> Said another way, the internal guest areas of concern aren’t something >>>>> the kernel would always be able to A) identify far in advance and B) >>>>> always solve on the users behalf. There is an argument to be made >>>>> that the guest needs to deal with its own house, yea? >>>> >>>> The issue is that the guest won't get a notification if vCPU0 is replaced with >>>> vCPU1 on the same physical CPU, thus the guest doesn't get an opportunity to emit >>>> IBPB. Since the host doesn't know whether or not the guest wants )IBPB, unless the >>>> owner of the host is also the owner of the guest workload, the safe approach is to >>>> assume the guest is vulnerable. >>> >>> Exactly. And if the guest has used taskset as its mitigation strategy, >>> how is the host to know? >> >> Yea thats fair enough. I posed a solution on Sean’s response just as this email >> came in, would love to know your thoughts (keying off MSR bitmap). >> > > I don't believe this works. The IBPBs in cond_mitigation (static in > arch/x86/mm/tlb.c) won't be triggered if the guest has given its > sensitive tasks exclusive use of their cores. And, if performance is a > concern, that is exactly what someone would do. I’m talking about within the guest itself, not the host level cond_mitigation. The purposed idea here would be to look at the MSR bitmap that is populated from the guest writing IBPB to that vCPU at least once in its lifetime, and that a security minded workload would indeed configure IBPB. Even with taskset, one would think that a security minded user would also setup IBPB to protect itself within the guest, which is exactly what the linux admin guide suggests that they do (in spectre.rst). Taking a step back and going back to the ground floor: What would be (or should be) the expectation from the guest in this example for their own security configuration? i.e. they are using taskset to assign security domain 0 to vcpu 0 and security domain 1 to vcpu 1. Would we expect them to always set up cond_ibpb (and prctl/seccomp) to protect against other casual user space threads that just might so happen to get scheduled into vcpu0/1? Or are we expecting them to configure always_ibpb? In such a security minded scenario, what would be the expectation of host configuration? If nothing else, I’d want to make sure we get the docs correct :) You mentioned if someone was concerned about performance, are you saying they also critically care about performance, such that they are willing to *not* use IBPB at all, and instead just use taskset and hope nothing ever gets scheduled on there, and then hope that the hypervisor does the job for them? Would this be the expectation of just KVM? Or all hypervisors on the market? Again not trying to be a hard head, just trying to wrap my own head around all of this. I appreciate the time from both you and Sean! Cheers, Jon