Re: [PATCH v3 1/4] KVM: x86: relax canonical check for some x86 architectural msrs

mlevitsk@xxxxxxxxxx · Fri, 23 Aug 2024 14:14:21 +0300

У ср, 2024-08-21 у 09:04 -0700, Sean Christopherson пише:
> > On Wed, Aug 21, 2024, mlevitsk@xxxxxxxxxx wrote:
> > > > У вт, 2024-08-20 у 15:13 +0300, mlevitsk@xxxxxxxxxx пише:
> > > > > > У пт, 2024-08-16 у 14:49 -0700, Sean Christopherson пише:
> > > > > > > > > > > > > > On Thu, Aug 15, 2024, Maxim Levitsky wrote:
> > > > > > > > > > > > > > > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > > > > > > > > > > > > > > > > > index ce7c00894f32..2e83f7d74591 100644
> > > > > > > > > > > > > > > > > > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > > > > > > > > > > > > > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > > > > > > > > > > > > > > > > > @@ -302,6 +302,31 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
> > > > > > > > > > > > > > > > > > > > > >                        sizeof(kvm_vcpu_stats_desc),
> > > > > > > > > > > > > > > > > > > > > >  };
> > > > > > > > > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > > > > > > > > > +
> > > > > > > > > > > > > > > > > > > > > > +/*
> > > > > > > > > > > > > > > > > > > > > > + * Most x86 arch MSR values which contain linear addresses like
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Is it most, or all?  I'm guessing all?
> > > > > > 
> > > > > > I can't be sure that all of them are like that - there could be some
> > > > > > outliers that behave differently.
> > > > > > 
> > > > > > One of the things my work at Intel taught me is that there is nothing
> > > > > > consistent in x86 spec, anything is possible and nothing can be assumed.
> > > > > > 
> > > > > > I dealt only with those msrs, that KVM checks for canonicality, therefore I
> > > > > > use the word  'most'. There could be other msrs that are not known to me
> > > > > > and/or to KVM.
> > > > > > 
> > > > > > I can write 'some' if you prefer.
> > > > 
> > > > Hi,
> > > > 
> > > > 
> > > > So I did some more reverse engineering and indeed, 'some' is the right word:
> > 
> > Is it?  IIUC, we have yet to find an MSR that honors, CR4.LA57, i.e. it really
> > is "all", so far as we know.

This is more or less what I meant to say: I meant to say that I verified only 'some'
of the msrs/instructions, and for others I can't know, although it's likely that
indeed our theory is correct.

> > 
> > > > I audited all places in KVM which check an linear address for being canonical
> > > > and this is what I found:
> > > > 
> > > > - MSR_IA32_BNDCFGS - since it is not supported on CPUs with 5 level paging,
> > > >   its not possible to know what the hardware does.
> > 
> > Heh, yeah, but I would be very surprised if MSR_IA32_BNDCFGS didn't follow all
> > other system-ish MSRs.
> > 
> > > > - MSR_IA32_DS_AREA: - Ignores CR4.LA57 as expected. Tested by booting into kernel
> > > >   with 5 level paging disabled and then using userspace 'wrmsr' program to
> > > >   set this msr.  I attached the bash script that I used
> > > > 
> > > > - MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B: - Exactly the same story,
> > > >   but for some reason the host doesn't suport (not even read) from
> > > >   MSR_IA32_RTIT_ADDR2_*, MSR_IA32_RTIT_ADDR3_*.  Probably the system is not new
> > > >   enough for these.
> > > > 
> > > > - invpcid instruction. It is exposed to the guest without interception
> > > >   (unless !npt or !ept), and yes, it works just fine on 57-canonical address
> > > >   without CR4.LA57 set....
> > 
> > Did you verify the behavior for the desciptor, the target, or both?  I assume
> > the memory operand, i.e. the address of the _descriptor_, honors CR4.LA57, but
> > the target within the descriptor does not.

I verified only the target address. I assume that memory fetches do have to honor the CR.LA57.

> > 
> > If that assumption is correct, then this code in vm_mmu_invalidate_addr() is broken,
> > as KVM actually needs to do a TLB flush if the address is canonical for the vCPU
> > model, even if it's non-canonical for the current vCPU state.
> > 
> >         /* It's actually a GPA for vcpu->arch.guest_mmu.  */
> >         if (mmu != &vcpu->arch.guest_mmu) {
> >                 /* INVLPG on a non-canonical address is a NOP according to the SDM.  */
> >                 if (is_noncanonical_address(addr, vcpu))
> >                         return;
> > 
> >                 kvm_x86_call(flush_tlb_gva)(vcpu, addr);
> >         }
> > 
> > Assuming INVPCID is indicative of how INVLPG and INVVPID behave (they are nops if
> > the _target_ is non-canonical), then the code is broken for INVLPG and INVVPID.
> > 
> > And I think it's probably a safe assumption that a TLB flush is needed.  E.g. the
> > primary (possible only?) use case for INVVPID with a linear address is to flush
> > TLB entries for a specific GVA=>HPA mapping, and honoring CR4.LA57 would prevent
> > shadowing 5-level paging with a hypervisor that is using 4-level paging for itself.

I also think so.

> > 
> > > > - invvpid - this one belongs to VMX set, so technically its for nesting
> > > >   although it is run by L1, it is always emulated by KVM, but still executed on
> > > >   the host just with different vpid, so I booted the host without 5 level
> > > >   paging, and patched KVM to avoid canonical check.
> > > > 
> > > >   Also 57-canonical adddress worked just fine, and fully non canonical
> > > >   address failed.  and gave a warning in 'invvpid_error'
> > > > 
> > > > Should I fix all of these too?
> > 
> > Yeah, though I believe we're at the point where we need to figure out a better
> > naming scheme, because usage of what is currently is_noncanonical_address() will
> > be, by far, in the minority.

Yes, I will take this into account.

> > 
> > Hmm, actually, what if we extend the X86EMUL_F_* flags that were added for LAM
> > (and eventually for LASS) to handle the non-canonical checks?  That's essentially
> > what LAM does already, there are just a few more flavors we now need to handle.
> > 
> > E.g. I think we just need flags for MSRs and segment/DT bases.  The only (or at
> > least, most) confusing thing is that LAM/LASS do NOT apply to INVPLG accesses,
> > but are exempt from LA57.  But that's an arch oddity, not a problem KVM can solve.
> > 
> > diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
> > index 55a18e2f2dcd..6da03a37bdd5 100644
> > --- a/arch/x86/kvm/kvm_emulate.h
> > +++ b/arch/x86/kvm/kvm_emulate.h
> > @@ -94,6 +94,8 @@ struct x86_instruction_info {
> >  #define X86EMUL_F_FETCH                BIT(1)
> >  #define X86EMUL_F_IMPLICIT             BIT(2)
> >  #define X86EMUL_F_INVLPG               BIT(3)
> > +#define X86EMUL_F_MSR                  BIT(4)
> > +#define X86EMUL_F_BASE                 BIT(5)
> >  
> >  struct x86_emulate_ops {
> >         void (*vm_bugged)(struct x86_emulate_ctxt *ctxt);
> > ---
> > 
> > And then with that, we can do the below, and have emul_is_noncanonical_address()
> > redirect to is_noncanonical_address() instead of being an open coded equivalent.
> > 
> > ---
> > static inline u8 vcpu_virt_addr_bits(struct kvm_vcpu *vcpu)
> > {
> >         return kvm_is_cr4_bit_set(vcpu, X86_CR4_LA57) ? 57 : 48;
> > }
> > 
> > static inline u8 max_host_virt_addr_bits(void)
> > {
> >         return kvm_cpu_cap_has(X86_FEATURE_LA57) ? 57 : 48;
> > }

This is a good name for this function, thanks.

> > 
> > static inline bool is_noncanonical_address(u64 la, struct kvm_vcpu *vcpu,
> >                                            unsigned int flags)
> > {
> >         if (flags & (X86EMUL_F_INVLPG | X86EMUL_F_MSR | X86EMUL_F_DT_LOAD))
> >                 return !__is_canonical_address(la, max_host_virt_addr_bits());
> >         else
> >                 return !__is_canonical_address(la, vcpu_virt_addr_bits(vcpu));
> > }

This can work in principle, although are you OK with using these emulator flags
outside of the emulator code?

I am asking because the is_noncanonical_address is used in various places across KVM.

> > ---
> > 
> > That will make it _much_ harder to incorrectly use is_noncanonical_address(),
> > as all callers will be forced to specify the emulation type, i.e. there is no
> > automatic, implied default type.
> > 
> > Line lengths could get annoying, but with per-type flags, we could do selectively
> > add a few wrappers, e.g.
> > 
> > ---
> > static inline bool is_noncanonical_msr_address(u64 la, struct kvm_vcpu *vcpu)
> > {
> >         return is_noncanonical_address(la, vcpu, X86EMUL_F_MSR);
> > }
> > 
> > static inline bool is_noncanonical_base_address(u64 la, struct kvm_vcpu *vcpu)
> > {
> >         return is_noncanonical_address(la, vcpu, X86EMUL_F_BASE);
> > }
> > 
> > static inline bool is_noncanonical_invlpg_address(u64 la, struct kvm_vcpu *vcpu)
> > {
> >         return is_noncanonical_address(la, vcpu, X86EMUL_F_INVLPG);
> > }
> > ---
> > 
> > We wouldn't want wrapper for everything, e.g. to minimize the risk of creating a
> > de factor implicit default, but I think those three, and maybe a code/fetch
> > variant, will cover all but a few users.
> > 
> > > > About fixing the emulator this is what see:
> > > > 
> > > >         emul_is_noncanonical_address
> > > >                 __load_segment_descriptor
> > > >                         load_segment_descriptor
> > > >                                 em_lldt
> > > >                                 em_ltr
> > > > 
> > > >                 em_lgdt_lidt
> > > > 
> > > > 
> > > > 
> > > > While em_lgdt_lidt should be easy to fix because it calls
> > > > emul_is_noncanonical_address directly,
> > 
> > Those don't need to be fixed, they are validating the memory operand, not the
> > base of the descriptor, i.e. aren't exempt from CR4.LA57.

Are you sure?

em_lgdt_lidt reads its memory operands (and checks that it is canonical through
linearize) with read_descriptor and that is fine because this is memory fetch, 
and then it checks that the base address within the operand is canonical.

This check needs to be updated, as it is possible to load non canonical GDT and IDT
base via lgdt/lidt (I tested this).

For em_lldt, em_ltr, the check on the system segment descriptor base is
in __load_segment_descriptor:

...
 ret = linear_read_system(ctxt, desc_addr+8, &base3, sizeof(base3));
 if (ret != X86EMUL_CONTINUE)
 return ret;
 if (emul_is_noncanonical_address(get_desc_base(&seg_desc) |
 ((u64)base3 << 32), ctxt))
 return emulate_gp(ctxt, err_code);

...

64 bases are possible only for system segments, which are
TSS, LDT, and call gates/IDT descriptors.

We don't emulate IDT fetches in protected mode, and as I found out the hard way after
I wrote a unit test to do a call through a call gate, the emulator doesn't
support call gates either)

Thus I can safely patch __load_segment_descriptor.

> > 
> > > > the em_lldt, em_ltr will be harder
> > > > because these use load_segment_descriptor which calls
> > > > __load_segment_descriptor which in turn is also used for emulating of far
> > > > jumps/calls/rets, for which I do believe that canonical check does respect
> > > > CR4.LA57, but can't be sure either.
> > 
> > I'm fairly certain this is a non-issue.  CS, DS, ES, and SS have a fixed base of
> > '0' in 64-bit mode, i.e. are completely exempt from canonical checks because the
> > base address is always ignored.  And while FS and GS do have base addresses, the
> > segment descriptors themselves can only load 32-bit bases, i.e. _can't_ generate
> > non-canonical addresses.

Yes - all data segments in 64 bit mode, still have 32 bit base in the GDT/LDT,
so this is indeed not an issue.

> > 

> > There _is_ an indirect canonical check on the _descriptor_ (the actual descriptor
> > pointed at by the selector, not the memory operand).  The SDM is calls this case
> > out in the LFS/LDS docs:
> > 
> >   If the FS, or GS register is being loaded with a non-NULL segment selector and
> >   any of the following is true: the segment selector index is not within descriptor
> >   table limits, the memory address of the descriptor is non-canonical
> >                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I tested that both ltr and lldt do ignore CR4.LA57 by loading from a descriptor
which had a non canonical value and CR4.LA57 clear.

> > and similarly, when using a CALL GATE, the far transfer docs say:
> > 
> >   If the segment descriptor from a 64-bit call gate is in non-canonical space.

For the academic reference, when loading offset from a call gate descriptor,
the CPU does the canonical check, and it does honour the CR4.LA57.

A variation of the below test can prove it:

void set_callgate_entry(u16 sel, void *offset, int dpl)
{
	idt_entry_t *e = (idt_entry_t *)&gdt[sel >> 3];
	set_desc_entry(e, sizeof *e, offset, read_cs(), 12, dpl);
}

int main(int argc, char **argv)
{
	u64 gate_offset = (u64)&&gate_entry_point;

	// toggle below to check various cases:
	//setup_5level_page_table();
	//gate_offset = 0xff4547ceb1600000;

	set_callgate_entry(FIRST_SPARE_SEL, (void *)gate_offset, 0);

	struct { u64 offset; u16 sel; } call_target =
		{ 0 /* ignored */,
		  FIRST_SPARE_SEL};

	// Perform the far call
	asm volatile goto (
//	    KVM_FEP
	    ".byte 0x48; rex.w\n" // optional, just for fun
	    "lcall *%0\n"
	    "jmp %l1\n"
	:
	: "m" (call_target)
	:
	: return_address
	);

	gate_entry_point:
	printf("Call gate worked\n");

	asm volatile (
	     "retfq\n"
	);

	return_address:

	printf("Exit from call gate worked\n");
	return 0;
}

> > 
> > And given that those implicit accesses are not subjected to LAM/LASS, I strongly
> > suspect they honor CR4.LA57.  So the explicit emul_is_noncanonical_address()
> > check in __load_segment_descriptor() needs to be tagged X86EMUL_F_BASE, but
> > otherwise it all should Just Work (knock wood).
> > 
> > > > It is possible that far jumps/calls/rets also ignore CR4.LA57, and instead
> > > > set RIP to non canonical instruction, and then on first fetch, #GP happens.
> > 
> > I doubt this is the case for the final RIP check, especially since ucode does
> > check vmcs.HOST_RIP against vmcs.HOST_CR4.LA57, but it's worth testing to confirm.

See above.

So now I'll try to do a v4 of the patches with all of the feedback included.
Thanks!

> > 
> > > > I'll setup another unit test for this. RIP of the #GP will determine if the
> > > > instruction failed or the next fetch.
> > 

Best regards,
	Maxim Levitsky