Re: [PATCH v3] x86/speculation, KVM: only IBPB for switch_mm_always_ibpb on vCPU load

Jon Kohler <jon@xxxxxxxxxxx> · Tue, 10 May 2022 14:49:15 +0000

> On May 10, 2022, at 10:22 AM, Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> 
> On Sat, Apr 30, 2022, Jon Kohler wrote:
>> 
>>> On Apr 30, 2022, at 5:50 AM, Borislav Petkov <bp@xxxxxxxxx> wrote:
>>> So let me try to understand this use case: you have a guest and a bunch
>>> of vCPUs which belong to it. And that guest gets switched between those
>>> vCPUs and KVM does IBPB flushes between those vCPUs.
>>> 
>>> So either I'm missing something - which is possible - but if not, that
>>> "protection" doesn't make any sense - it is all within the same guest!
>>> So that existing behavior was silly to begin with so we might just as
>>> well kill it.
>> 
>> Close, its not 1 guest with a bunch of vCPU, its a bunch of guests with
>> a small amount of vCPUs, thats the small nuance here, which is one of 
>> the reasons why this was hard to see from the beginning. 
>> 
>> AFAIK, the KVM IBPB is avoided when switching in between vCPUs
>> belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could 
>> have one VM highly oversubscribed to the host and you wouldn’t see
>> either the KVM IBPB or the switch_mm IBPB. All good. 
> 
> No, KVM does not avoid IBPB when switching between vCPUs in a single VM.  Every
> vCPU has a separate VMCS/VMCB, and so the scenario described above where a single
> VM has a bunch of vCPUs running on a limited set of logical CPUs will emit IBPB
> on every single switch.

Ah! Right, ok thanks for helping me get my wires uncrossed there, I was getting
confused from the nested optimization made on
5c911beff KVM: nVMX: Skip IBPB when switching between vmcs01 and vmcs02

So the only time we’d *not* issue IBPB is if the current per-vcpu vmcs/vmcb is
still loaded in the non-nested case, or between guests in the nested case.

Walking through my thoughts again here with this fresh in my mind:

In that example, say guest A has vCPU0 and vCPU1 and has to switch
in between the two on the same pCPU, it isn’t doing a switch_mm()
because its the same mm_struct; however, I’d wager to say that if you
had an attacker on the guest VM, executing an attack on vCPU0 with
the intent of attacking vCPU1 (which is up to run next), you have far
bigger problems, as that would imply the guest is completely
compromised, so why would they even waste time on a complex
prediction attack when they have that level of system access in
the first place?

Going back to the original commit documentation that Boris called out, 
that specifically says:

   * Mitigate guests from being attacked by other guests.
     - This is addressed by issing IBPB when we do a guest switch.

Since you need to go through switch_mm() to change mm_struct from
Guest A to Guest B, it makes no sense to issue the barrier in KVM,
as the kernel is already giving that “for free” (from KVM’s perspective) as
the guest-to-guest transition is already covered by cond_mitigation().

That would apply equally for switches within both nested and non-nested
cases, because switch_mm needs to be called between guests.