Re: KVM's sloppiness wrt IA32_SPEC_CTRL and IA32_PRED_CMD

Jim Mattson <jmattson@xxxxxxxxxx> · Thu, 20 Jul 2023 10:52:44 -0700

On Thu, Jul 20, 2023 at 1:04 AM Chao Gao <chao.gao@xxxxxxxxx> wrote:
>
> On Wed, Jul 19, 2023 at 09:04:58PM -0700, Jim Mattson wrote:
> >On Wed, Jul 19, 2023 at 6:58 PM Chao Gao <chao.gao@xxxxxxxxx> wrote:
> >>
> >> On Thu, Jul 20, 2023 at 09:25:14AM +0800, Xiaoyao Li wrote:
> >> >On 7/20/2023 2:08 AM, Jim Mattson wrote:
> >> >> Normally, we would restrict guest MSR writes based on guest CPU
> >> >> features. However, with IA32_SPEC_CTRL and IA32_PRED_CMD, this is not
> >> >> the case.
> >>
> >> This issue isn't specific to the two MSRs. Any MSRs that are not
> >> intercepted and with some reserved bits for future extenstions may run
> >> into this issue. Right? IMO, it is a conflict of interests between
> >> disabling MSR write intercept for less VM-exits and host's control over
> >> the value written to the MSR by guest.
> >
> >I've clearly been falling behind in tracking upstream development. I
> >didn't realize that we passed through any other MSRs that had bits
> >reserved for future extensions (virtual addresses don't count). It
> >looks like we've decided to virtualize IA32_FLUSH_CMD as well, even
> >though Konrad had the good sense to talk me out of it when I first
> >proposed it. Are there others I'm missing?
>
> MSR_IA32_XFD{_ERR}, I think
>
> SDM says:
> Bit i of either MSR can be set to 1 only if CPUID.(EAX=0DH,ECX=i):ECX[2]
> is enumerated as 1 (see Section 13.2). An execution of WRMSR that
> attempts to set an unsupported bit in either MSR causes a
> general-protection fault (#GP)
>
> >
> >Philosophically, there are three principles potentially in conflict
> >here: security, correctness, and performance. Userspace should perhaps
> >be given the option of prioritizing one over the others, but the
> >default precedence should be security first, correctness second, and
> >performance last.
>
> I am not sure about the default precedence. I can name a few other cases
> in which KVM behavior doesn't align with this precedence.
> 1. new instructions w/o control bits in CR0/4

We've come a long way since Popek & Goldberg, but that's an example of
why the x86 architecture still isn't virtualizable. I don't think
anything can be done about that.

> 2. CR3.LAM_U57/U48 can always be set by guests if KVM doesn't trap CR3
>    writes, e.g., when EPT is enabled. This case is identical to the PSFD
>    case you mentioned below.

LAM is another feature I haven't been following. It's a bit sad that
the x86 vendors continue to introduce new features that restrict the
host platforms that can share a migration pool.

> 3. GBPAGES*

Is there an Intel CPU that isn't well past EOL that doesn't support GBPAGES?

> *: https://lore.kernel.org/all/20230217231022.816138-3-seanjc@xxxxxxxxxx/
>
> >
> >> We may need something like CR0/CR4 masks and read shadows for all MSRs
> >> to address this fundamental issue.
> >
> >Not *all* MSRs, but some, certainly. That is one possible solution,
> >but I get the impression that you're not really serious about this
> >proposal.
>
> I meant we need a generic mechanism applicable to all MSRs. There are up
> to 2K MSRs (MSR-bitmap is 4KB). Then adding a mask and read shadow e.g.,
> 16 Bytes for each MSR needs about 32KB. I don't think it is completely
> unacceptable because, IIRC, IPI virtualization takes up to 64KB. To
> reduce the memory consumption, we can even let CPU consume a list of MSR
> index, mask and read shadow, like VM-entry/exit "MSR-load" count/address.

I assume the quoted IPI virtualization overhead is per-VM. We would
need a read shadow per vCPU.

> >> >>
> >> >> For the first non-zero write to IA32_SPEC_CTRL, we check to see that
> >> >> the host supports the value written. We don't care whether or not the
> >> >> guest supports the value written (as long as it supports the MSR).
> >> >> After the first non-zero write, we stop intercepting writes to
> >> >> IA32_SPEC_CTRL, so the guest can write any value supported by the
> >> >> hardware. This could be problematic in heterogeneous migration pools.
> >> >> For instance, a VM that starts on a Cascade Lake host may set
> >> >> IA32_SPEC_CTRL.PSFD[bit 7], even if the guest
> >> >> CPUID.(EAX=07H,ECX=02H):EDX.PSFD[bit 0] is clear. Then, if that VM is
> >> >> migrated to a Skylake host, KVM_SET_MSRS will refuse to set
> >> >> IA32_SPEC_CTRL to its current value, because Skylake doesn't support
> >> >> PSFD.
> >>
> >> It is a guest fault. Can we modify guest kernel in this case?
> >
> >The guest should not have set the bit. The hypervisor should not have
> >allowed it. Both are at fault.
> >
> >I'm willing to bet that Intel has a CPU validation suite that includes
> >such tests as setting reserved bits in MSRs and ensuring that #GP is
> >raised. Those tests should also work in a VM. If they don't, the
> >hypervisor is broken.
>
> I agree hypervisor is broken in this specific case. I just doubt if it
> is worthwhile to fix this issue i.e., the benefit is significant. I am
> assuing the benefit of fixing the issue is just guests won't be able to
> set reserved bits and attempts to do that cause #GP.

I'm not entirely convinced that setting bits in IA32_SPEC_CTRL is (and
always will be) "safe," from a security perspective. *Clearing* bits
certainly isn't.

> And is it fair to good citizens that won't set reserved bits but will
> suffer performance drop caused by the fix?

Is it fair to other tenants of the host to have their data exfiltrated
by a bad citizen, because KVM didn't control access to the MSR?
> >
> >> >>
> >> >> We disable write intercepts IA32_PRED_CMD as long as the guest
> >> >> supports the MSR. That's fine for now, since only one bit of PRED_CMD
> >> >> has been defined. Hence, guest support and host support are
> >> >> equivalent...today. But, are we really comfortable with letting the
> >> >> guest set any IA32_PRED_CMD bit that may be defined in the future?
> >> >>
> >> >> The same question applies to IA32_SPEC_CTRL. Are we comfortable with
> >> >> letting the guest write to any bit that may be defined in the future?
> >> >
> >> >My point is we need to fix it, though Chao has different point that sometimes
> >> >performance may be more important[*]
> >> >
> >> >[*] https://lore.kernel.org/all/ZGdE3jNS11wV+V2w@chao-email/
> >>
> >> Maybe KVM can provide options to QEMU. e.g., we can define a KVM quirk.
> >> Disabling the quirk means always intercept IA32_SPEC_CTRL MSR writes.
> >
> >Alternatively, we can check the host value of IA32_SPEC_CTRL on
> >VM-entry, since we have to read it anyway. If any bits are set that
> >cannot be cleared in VMX non-root operation without compromising
> >security, then writes to IA32_SPEC_CTRL should be intercepted.
> >
> >> >
> >> >> At least the AMD approach with V_SPEC_CTRL prevents the guest from
> >> >> clearing any bits set by the host, but on Intel, it's a total
> >> >> free-for-all. What happens when a new bit is defined that absolutely
> >> >> must be set to 1 all of the time?
> >>
> >> I suppose there is no such bit now. For SPR and future CPUs, "virtualize
> >> IA32_SPEC_CTRL" VMX feature can lock some bits to 0 or 1 regardless of
> >> the value written by guests.
> >
> >As your colleague pointed out earlier, IA32_SPEC_CTRL.STIBP[bit 1] is
> >such a bit. If the host has this bit set and you allow the guest to
> >clear it, then you have compromised host security.
>
> If guest can compromise host security, I definitly agree to intercept
> IA32_SPEC_CTRL MSR.

I believe that when the decision was made to pass through this MSR for
write, the assumption was that the host wouldn't ever use it (hence
the host value would be zero). That assumption has not stood the test
of time.