У вт, 2024-08-20 у 15:13 +0300, mlevitsk@xxxxxxxxxx пише: > У пт, 2024-08-16 у 14:49 -0700, Sean Christopherson пише: > > > > > On Thu, Aug 15, 2024, Maxim Levitsky wrote: > > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > > > > > > index ce7c00894f32..2e83f7d74591 100644 > > > > > > > > > --- a/arch/x86/kvm/x86.c > > > > > > > > > +++ b/arch/x86/kvm/x86.c > > > > > > > > > @@ -302,6 +302,31 @@ const struct kvm_stats_header kvm_vcpu_stats_header = { > > > > > > > > > sizeof(kvm_vcpu_stats_desc), > > > > > > > > > }; > > > > > > > > > > > > > > > > > > + > > > > > > > > > +/* > > > > > > > > > + * Most x86 arch MSR values which contain linear addresses like > > > > > > > > > > Is it most, or all? I'm guessing all? > > I can't be sure that all of them are like that - there could be some outliers that behave differently. > > One of the things my work at Intel taught me is that there is nothing consistent > in x86 spec, anything is possible and nothing can be assumed. > > I dealt only with those msrs, that KVM checks for canonicality, therefore I use the word > 'most'. There could be other msrs that are not known to me and/or to KVM. > > I can write 'some' if you prefer. Hi, So I did some more reverse engineering and indeed, 'some' is the right word: I audited all places in KVM which check an linear address for being canonical and this is what I found: - MSR_IA32_BNDCFGS - since it is not supported on CPUs with 5 level paging, its not possible to know what the hardware does. - MSR_IA32_DS_AREA: - Ignores CR4.LA57 as expected. Tested by booting into kernel with 5 level paging disabled and then using userspace 'wrmsr' program to set this msr. I attached the bash script that I used - MSR_IA32_RTIT_ADDR0_A ... MSR_IA32_RTIT_ADDR3_B: - Exactly the same story, but for some reason the host doesn't suport (not even read) from MSR_IA32_RTIT_ADDR2_*, MSR_IA32_RTIT_ADDR3_*. Probably the system is not new enough for these. - invpcid instruction. It is exposed to the guest without interception (unless !npt or !ept), and yes, it works just fine on 57-canonical address without CR4.LA57 set.... - invvpid - this one belongs to VMX set, so technically its for nesting although it is run by L1, it is always emulated by KVM, but still executed on the host just with different vpid, so I booted the host without 5 level paging, and patched KVM to avoid canonical check. Also 57-canonical adddress worked just fine, and fully non canonical address failed. and gave a warning in 'invvpid_error' Should I fix all of these too? About fixing the emulator this is what see: emul_is_noncanonical_address __load_segment_descriptor load_segment_descriptor em_lldt em_ltr em_lgdt_lidt While em_lgdt_lidt should be easy to fix because it calls emul_is_noncanonical_address directly, the em_lldt, em_ltr will be harder because these use load_segment_descriptor which calls __load_segment_descriptor which in turn is also used for emulating of far jumps/calls/rets, for which I do believe that canonical check does respect CR4.LA57, but can't be sure either. It is possible that far jumps/calls/rets also ignore CR4.LA57, and instead set RIP to non canonical instruction, and then on first fetch, #GP happens. I'll setup another unit test for this. RIP of the #GP will determine if the instruction failed or the next fetch. Best regards, Maxim Levitsky > > > > > > > > > > > > > > > + * segment bases, addresses that are used in instructions (e.g SYSENTER), > > > > > > > > > + * have static canonicality checks, > > > > > > > > > > Weird and early line breaks. > > > > > > > > > > How about this? > > > > > > > > > > /* > > > > > * The canonicality checks for MSRs that hold linear addresses, e.g. segment > > > > > * bases, SYSENTER targets, etc., are static, in the sense that they are based > > > > > * on CPU _support_ for 5-level paging, not the state of CR4.LA57. > > > > > > > > > > > > > > + * size of whose depends only on CPU's support for 5-level > > > > > > > > > + * paging, rather than state of CR4.LA57. > > > > > > > > > + * > > > > > > > > > + * In addition to that, some of these MSRS are directly passed > > > > > > > > > + * to the guest (e.g MSR_KERNEL_GS_BASE) thus even if the guest > > > > > > > > > + * doen't have LA57 enabled in its CPUID, for consistency with > > > > > > > > > + * CPUs' ucode, it is better to pivot the check around host > > > > > > > > > + * support for 5 level paging. > > > > > > > > > > I think we should elaborate on why it's better. It only takes another line or > > > > > two, and that way we don't forget the edge cases that make properly emulating > > > > > guest CPUID a bad idea. > > OK, will do. > > > > > > > > > > > * This creates a virtualization hole where a guest writes to passthrough MSRs > > > > > * may incorrectly succeed if the CPU supports LA57, but the vCPU does not > > > > > * (because hardware has no awareness of guest CPUID). Do not try to plug this > > > > > * hole, i.e. emulate the behavior for intercepted accesses, as injecting #GP > > > > > * depending on whether or not KVM happens to emulate a WRMSR would result in > > > > > * non-deterministic behavior, and could even allow L2 to crash L1, e.g. if L1 > > > > > * passes through an MSR to L2, and then tries to save+restore L2's value. > > > > > */ > > > > > > > > > > > > > > + > > > > > > > > > +static u8 max_host_supported_virt_addr_bits(void) > > > > > > > > > > Any objection to dropping the "supported", i.e. going with max_host_virt_addr_bits()? > > > > > Mostly to shorten the name, but also because "supported" suggests there's software > > > > > involvement, e.g. the max supported by the kernel/KVM, which isn't the case. > > Doesn't matter to me. > > > > > > > > > > > If you're ok with the above, I'll fixup when applying. > > > > > > > Best regards, > Maxim Levitsky