Re: [PATCH v3 1/4] KVM: x86: relax canonical check for some x86 architectural msrs

Sean Christopherson <seanjc@xxxxxxxxxx> · Wed, 21 Aug 2024 09:04:35 -0700

On Wed, Aug 21, 2024, mlevitsk@xxxxxxxxxx wrote:
> У вт, 2024-08-20 у 15:13 +0300, mlevitsk@xxxxxxxxxx пише:
> > У пт, 2024-08-16 у 14:49 -0700, Sean Christopherson пише:
> > > > > > On Thu, Aug 15, 2024, Maxim Levitsky wrote:
> > > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > > > > > index ce7c00894f32..2e83f7d74591 100644
> > > > > > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > > > > > @@ -302,6 +302,31 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
> > > > > > > > > >                        sizeof(kvm_vcpu_stats_desc),
> > > > > > > > > >  };
> > > > > > > > > >  
> > > > > > > > > > +
> > > > > > > > > > +/*
> > > > > > > > > > + * Most x86 arch MSR values which contain linear addresses like
> > > > > > 
> > > > > > Is it most, or all?  I'm guessing all?
> > 
> > I can't be sure that all of them are like that - there could be some
> > outliers that behave differently.
> > 
> > One of the things my work at Intel taught me is that there is nothing
> > consistent in x86 spec, anything is possible and nothing can be assumed.
> > 
> > I dealt only with those msrs, that KVM checks for canonicality, therefore I
> > use the word  'most'. There could be other msrs that are not known to me
> > and/or to KVM.
> > 
> > I can write 'some' if you prefer.
> 
> Hi,
> 
> 
> So I did some more reverse engineering and indeed, 'some' is the right word:

Is it?  IIUC, we have yet to find an MSR that honors, CR4.LA57, i.e. it really
is "all", so far as we know.

> I audited all places in KVM which check an linear address for being canonical
> and this is what I found:
> 
> - MSR_IA32_BNDCFGS - since it is not supported on CPUs with 5 level paging,
>   its not possible to know what the hardware does.

Heh, yeah, but I would be very surprised if MSR_IA32_BNDCFGS didn't follow all
other system-ish MSRs.

> - MSR_IA32_DS_AREA: - Ignores CR4.LA57 as expected. Tested by booting into kernel
>   with 5 level paging disabled and then using userspace 'wrmsr' program to
>   set this msr.  I attached the bash script that I used
> 
> - MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B: - Exactly the same story,
>   but for some reason the host doesn't suport (not even read) from
>   MSR_IA32_RTIT_ADDR2_*, MSR_IA32_RTIT_ADDR3_*.  Probably the system is not new
>   enough for these.
>
> - invpcid instruction. It is exposed to the guest without interception
>   (unless !npt or !ept), and yes, it works just fine on 57-canonical address
>   without CR4.LA57 set....

Did you verify the behavior for the desciptor, the target, or both?  I assume
the memory operand, i.e. the address of the _descriptor_, honors CR4.LA57, but
the target within the descriptor does not.

If that assumption is correct, then this code in vm_mmu_invalidate_addr() is broken,
as KVM actually needs to do a TLB flush if the address is canonical for the vCPU
model, even if it's non-canonical for the current vCPU state.

	/* It's actually a GPA for vcpu->arch.guest_mmu.  */
	if (mmu != &vcpu->arch.guest_mmu) {
		/* INVLPG on a non-canonical address is a NOP according to the SDM.  */
		if (is_noncanonical_address(addr, vcpu))
			return;

		kvm_x86_call(flush_tlb_gva)(vcpu, addr);
	}

Assuming INVPCID is indicative of how INVLPG and INVVPID behave (they are nops if
the _target_ is non-canonical), then the code is broken for INVLPG and INVVPID.

And I think it's probably a safe assumption that a TLB flush is needed.  E.g. the
primary (possible only?) use case for INVVPID with a linear address is to flush
TLB entries for a specific GVA=>HPA mapping, and honoring CR4.LA57 would prevent
shadowing 5-level paging with a hypervisor that is using 4-level paging for itself.

> - invvpid - this one belongs to VMX set, so technically its for nesting
>   although it is run by L1, it is always emulated by KVM, but still executed on
>   the host just with different vpid, so I booted the host without 5 level
>   paging, and patched KVM to avoid canonical check.
> 
>   Also 57-canonical adddress worked just fine, and fully non canonical
>   address failed.  and gave a warning in 'invvpid_error'
>
> Should I fix all of these too?

Yeah, though I believe we're at the point where we need to figure out a better
naming scheme, because usage of what is currently is_noncanonical_address() will
be, by far, in the minority.

Hmm, actually, what if we extend the X86EMUL_F_* flags that were added for LAM
(and eventually for LASS) to handle the non-canonical checks?  That's essentially
what LAM does already, there are just a few more flavors we now need to handle.

E.g. I think we just need flags for MSRs and segment/DT bases.  The only (or at
least, most) confusing thing is that LAM/LASS do NOT apply to INVPLG accesses,
but are exempt from LA57.  But that's an arch oddity, not a problem KVM can solve.

diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
index 55a18e2f2dcd..6da03a37bdd5 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -94,6 +94,8 @@ struct x86_instruction_info {
 #define X86EMUL_F_FETCH                BIT(1)
 #define X86EMUL_F_IMPLICIT             BIT(2)
 #define X86EMUL_F_INVLPG               BIT(3)
+#define X86EMUL_F_MSR                  BIT(4)
+#define X86EMUL_F_BASE                 BIT(5)
 
 struct x86_emulate_ops {
        void (*vm_bugged)(struct x86_emulate_ctxt *ctxt);
---

And then with that, we can do the below, and have emul_is_noncanonical_address()
redirect to is_noncanonical_address() instead of being an open coded equivalent.

---
static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu)
{
	return kvm_is_cr4_bit_set(vcpu, X86_CR4_LA57) ? 57 : 48;
}

static inline u8 max_host_virt_addr_bits(void)
{
	return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48;
}

static inline bool is_noncanonical_address(u64 la, struct kvm_vcpu *vcpu,
					   unsigned int flags)
{
	if (flags & (X86EMUL_F_INVLPG | X86EMUL_F_MSR | X86EMUL_F_DT_LOAD))
		return !__is_canonical_address(la, max_host_virt_addr_bits());
	else
		return !__is_canonical_address(la, vcpu_virt_addr_bits(vcpu));
}
---

That will make it _much_ harder to incorrectly use is_noncanonical_address(),
as all callers will be forced to specify the emulation type, i.e. there is no
automatic, implied default type.

Line lengths could get annoying, but with per-type flags, we could do selectively
add a few wrappers, e.g.

---
static inline bool is_noncanonical_msr_address(u64 la, struct kvm_vcpu *vcpu)
{
	return is_noncanonical_address(la, vcpu, X86EMUL_F_MSR);
}

static inline bool is_noncanonical_base_address(u64 la, struct kvm_vcpu *vcpu)
{
	return is_noncanonical_address(la, vcpu, X86EMUL_F_BASE);
}

static inline bool is_noncanonical_invlpg_address(u64 la, struct kvm_vcpu *vcpu)
{
	return is_noncanonical_address(la, vcpu, X86EMUL_F_INVLPG);
}
---

We wouldn't want wrapper for everything, e.g. to minimize the risk of creating a
de factor implicit default, but I think those three, and maybe a code/fetch
variant, will cover all but a few users.

> About fixing the emulator this is what see:
> 
> 	emul_is_noncanonical_address
> 		__load_segment_descriptor
> 			load_segment_descriptor
> 				em_lldt
> 				em_ltr
> 
> 		em_lgdt_lidt
> 
> 
> 
> While em_lgdt_lidt should be easy to fix because it calls
> emul_is_noncanonical_address directly,

Those don't need to be fixed, they are validating the memory operand, not the
base of the descriptor, i.e. aren't exempt from CR4.LA57.

> the em_lldt, em_ltr will be harder
> because these use load_segment_descriptor which calls
> __load_segment_descriptor which in turn is also used for emulating of far
> jumps/calls/rets, for which I do believe that canonical check does respect
> CR4.LA57, but can't be sure either.

I'm fairly certain this is a non-issue.  CS, DS, ES, and SS have a fixed base of
'0' in 64-bit mode, i.e. are completely exempt from canonical checks because the
base address is always ignored.  And while FS and GS do have base addresses, the
segment descriptors themselves can only load 32-bit bases, i.e. _can't_ generate
non-canonical addresses.

There _is_ an indirect canonical check on the _descriptor_ (the actual descriptor
pointed at by the selector, not the memory operand).  The SDM is calls this case
out in the LFS/LDS docs:

  If the FS, or GS register is being loaded with a non-NULL segment selector and
  any of the following is true: the segment selector index is not within descriptor
  table limits, the memory address of the descriptor is non-canonical
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
and similarly, when using a CALL GATE, the far transfer docs say:

  If the segment descriptor from a 64-bit call gate is in non-canonical space.

And given that those implicit accesses are not subjected to LAM/LASS, I strongly
suspect they honor CR4.LA57.  So the explicit emul_is_noncanonical_address()
check in __load_segment_descriptor() needs to be tagged X86EMUL_F_BASE, but
otherwise it all should Just Work (knock wood).

> It is possible that far jumps/calls/rets also ignore CR4.LA57, and instead
> set RIP to non canonical instruction, and then on first fetch, #GP happens.

I doubt this is the case for the final RIP check, especially since ucode does
check vmcs.HOST_RIP against vmcs.HOST_CR4.LA57, but it's worth testing to confirm.

> I'll setup another unit test for this. RIP of the #GP will determine if the
> instruction failed or the next fetch.