Re: [RFC PATCH 04/23] x86/cpufeatures: Add SGX1 and SGX2 sub-features

Sean Christopherson <seanjc@xxxxxxxxxx> · Mon, 11 Jan 2021 09:54:17 -0800

On Sat, Jan 09, 2021, Borislav Petkov wrote:
> On Fri, Jan 08, 2021 at 03:55:52PM -0800, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
> > index dc921d76e42e..21f92d81d5a5 100644
> > --- a/arch/x86/kvm/cpuid.h
> > +++ b/arch/x86/kvm/cpuid.h
> > @@ -7,7 +7,25 @@
> >  #include <asm/processor.h>
> >  #include <uapi/asm/kvm_para.h>
> > 
> > -extern u32 kvm_cpu_caps[NCAPINTS] __read_mostly;
> > +/*
> > + * Hardware-defined CPUID leafs that are scattered in the kernel, but need to
> > + * be directly by KVM.  Note, these word values conflict with the kernel's
> > + * "bug" caps, but KVM doesn't use those.
> 
> This feels like another conflict waiting to happen if KVM decides to use
> them at some point...

Yes, but KVM including the bug caps in kvm_cpu_caps is extremely unlikely, and
arguably flat out wrong.  Currently, kvm_cpu_caps includes only CPUID-based
features that can be exposed direcly to the guest.  I could see a scenario where
KVM exposed "bug" capabilities to the guest via a paravirt interface, but I
would expect that KVM would either filter and expose the kernel's bug caps
without userspace input, or would add a KVM-defined paravirt CPUID leaf to
enumerate the caps and track _that_ in kvm_cpu_caps.

Anyways, I agree that overlapping the bug caps it's a bit of unnecessary
cleverness.  I'm not opposed to incorporating NBUGINTS into KVM, but that would
mean explicitly pulling in even more x86_capability implementation details.

> So let me get this straight: KVM wants to use X86_FEATURE_* which
> means, those numbers must map to the respective words in its CPUID caps
> representation kvm_cpu_caps, AFAICT.

That part is deliberate and isn't a dependency so much as how things are
implemented.  The true dependency is on the bit offsets within each word.  The
kernel could completely rescramble the word numbering and KVM would chug along
happily.  What KVM won't play nice with is if the kernel broke up a hardware-
defined, gathered CPUID leaf/word into scattered features spread out amongst
multiple Linux-defined words.

> Then, it wants the leafs to correspond to the hardware leafs layout so
> that it can do:
> 
> 	kvm_cpu_caps[leaf] &= *__cpuid_entry_get_reg(&entry, cpuid.reg);
> 
> which comes straight from CPUID.
> 
> So lemme look at one word:
> 
>         kvm_cpu_cap_mask(CPUID_1_EDX,
>                 F(FPU) | F(VME) | F(DE) | F(PSE) |
>                 F(TSC) | F(MSR) | F(PAE) | F(MCE) |
> 		...
> 
> 
> it would build the bitmask of the CPUID leaf using X86_FEATURE_* bits
> and then mask it out with the hardware leaf read from CPUID.
> 
> But why?
> 
> Why doesn't it simply build those leafs in kvm_cpu_caps from the leafs
> we've already queried?
> 
> Oh it does so a bit earlier:
> 
>         memcpy(&kvm_cpu_caps, &boot_cpu_data.x86_capability,
>                sizeof(kvm_cpu_caps));
> 
> and that kvm_cpu_cap_mask() call is to clear some bits in kvm_cpu_caps
> which is kvm-specific thing (not supported stuff etc).
> 
> But then why does kvm_cpu_cap_mask() does cpuid_count()? Didn't it just
> read the bits from boot_cpu_data.x86_capability? And those bits we do
> query and massage extensively during boot. So why does KVM needs to
> query CPUID again instead of using what we've already queried?

It's mostly historical; before the kvm_cpu_caps concept was introduced, the code
had grown organically to include both boot_cpu_data and raw CPUID info.  The
vast, vast majority of the time, doing CPUID is likely redundant.  But, as noted
in commit d8577a4c238f ("KVM: x86: Do host CPUID at load time to mask KVM cpu
caps"), the code is quite cheap and runs once at KVM load.  My argument back
then was, and still is, that an extra bit of paranoia is justified since the
code and operations are quite nearly free.

This particular dependency can be broken, and quite easily at that.  Rather than
memcpy() boot_cpu_data.x86_capability, it's trivially easy to redefine the F()
macro to invoke boot_cpu_has(), which would allow dropping the memcpy().  The
big downside, and why I didn't post the code, is that doing so means every
feature routed through F() requires some form of BT+Jcc (or CMOVcc) sequence,
whereas the mempcy() approach allows the F() features to be encoded as a single
literal by the compiler.

>From a latency perspective, the extra code is negligible.  The big issue is that
all those extra checks add 2k+ bytes of code.  Eliminating the mempcy() doesn't
actually break KVM's dependency on the bit offsets, so we'd be bloating kvm.ko
by a noticeable amount without providing substantial value.

And, this behavior is mostly opportunistic; the true justification/motiviation
for taking a dependency on the X86_FEATURE_* bit offsets is for communication
with userspace, querying the guest CPU model, and runtime checks.

> Maybe I'm missing something kvm-specific.
> 
> In any case, this feels somewhat weird: you have *_cpu_has() on
> baremetal abstracting almost completely from CPUID by collecting all
> feature bits it needs into its own structure - x86_capability[] along
> with accessors for it - and then you want to "abstract back" to CPUID
> leafs from that interface. I wonder why.

It's effectively for communication with userspace.  Userspace, via ioctl(),
dictates the vCPU model to KVM, including the exact CPUID results.  to properly
virtualize/emulate the defined vCPU model, KVM must query the dictated CPUID
results to determine what features are supported, what guest operations
should fault, etc...  E.g. if the vCPU model, via CPUID, states that SMEP isn't
supported then KVM needs to inject a #GP if the guest attempts to set CR4.SMEP.

KVM also uses the hardware-defined CPUID ABI to advertise which features are
supported by both hardware and KVM.  This is the kvm_cpu_cap stuff, where KVM
reads boot_cpu_data to see what features were enabled by the kernel.

It would be possible for KVM to break the dependency on X86_FEATURE_* bit
offsets by defining a translation layer, but I strongly feel that adding manual
translations will do more harm than good as it increases the odds of us botching
a translation or using the wrong feature flag, creates potential namespace
conflicts, etc...