On Fri, Apr 29, 2022 at 11:23:32PM +0000, Sean Christopherson wrote: > The host kernel is protected via RETPOLINE and by flushing the RSB immediately > after VM-Exit. Ah, right. > I don't know definitively. My guess is that IBPB is far too costly to do on every > exit, and so the onus was put on userspace to recompile with RETPOLINE. What I > don't know is why it wasn't implemented as an opt-out feature. Or, we could add the same logic on the exit path as in cond_mitigation() and test for LAST_USER_MM_IBPB when the host has selected switch_mm_cond_ibpb and thus allows for certain guests to be protected... Although, that use case sounds kinda meh: AFAIU, the attack vector here would be, protecting the guest from a malicious kernel. I guess this might matter for encrypted guests tho. > I'll write up the bits I have my head wrapped around. That's nice, thanks! > I don't know of any actual examples. But, it's trivially easy to create multiple > VMs in a single process, and so proving the negative that no one runs multiple VMs > in a single address space is essentially impossible. > > The container thing is just one scenario I can think of where userspace might > actually benefit from sharing an address space, e.g. it would allow backing the > image for large number of VMs with a single set of read-only VMAs. Why I keep harping about this: so let's say we eventually add something and then months, years from now we cannot find out anymore why that thing was added. We will likely remove it or start wasting time figuring out why that thing was added in the first place. This very questioning keeps popping up almost on a daily basis during refactoring so I'd like for us to be better at documenting *why* we're doing a certain solution or function or whatever. And this is doubly important when it comes to the hw mitigations because if you look at the problem space and all the possible ifs and whens and but(t)s, your head will spin in no time. So I'm really sceptical when there's not even a single actual use case to a proposed change. So Jon said something about oversubscription and a lot of vCPU switching. That should be there in the docs as the use case and explaining why dropping IBPB during vCPU switches is redundant. > I truly have no idea, which is part of the reason I brought it up in the first > place. I'd have happily just whacked KVM's IBPB entirely, but it seemed prudent > to preserve the existing behavior if someone went out of their way to enable > switch_mm_always_ibpb. So let me try to understand this use case: you have a guest and a bunch of vCPUs which belong to it. And that guest gets switched between those vCPUs and KVM does IBPB flushes between those vCPUs. So either I'm missing something - which is possible - but if not, that "protection" doesn't make any sense - it is all within the same guest! So that existing behavior was silly to begin with so we might just as well kill it. > Yes, or do it iff switch_mm_always_ibpb is enabled to maintain "compability". Yap, and I'm questioning the even smallest shred of reasoning for having that IBPB flush there *at* *all*. And here's the thing with documenting all that: we will document and say, IBPB between vCPU flushes is non-sensical. Then, when someone comes later and says, "but but, it makes sense because of X" and we hadn't thought about X at the time, we will change it and document it again and this way you'll have everything explicit there, how we arrived at the current situation and be able to always go, "ah, ok, that's why we did it." I hope I'm making some sense... -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette