On Fri, Sep 22, 2023 at 11:15 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > On Fri, Sep 22, 2023, Jim Mattson wrote: > > On Fri, Sep 22, 2023 at 10:21 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > IMO, we should delete the offending kernel code. I don't see how it provides any > > > value these days. > > > > Sure, but that doesn't help legacy guests. > > Heh, IMO they don't need help, their owners just need to be placated ;-) > > > > And *if* we want to change something in KVM so that we stop getting coustomer > > > complaints about a useless bit, just let userspace stuff the bit. > > > > We want to make customers happy. That should not even be a question. > > Can we really not tell them "this is a benign guest bug, ignore it"? What is the mechanism for doing that? > > > I think we should also raise the issue with AMD (Borislav maybe?) and ask/demand > > > that bits in HWCR that KVM allows to be set are architecturally defined. It's > > > totally fine if the value of bit 24 is uarch specific, but the behavior needs to > > > be something that won't change from processor to processor. > > > > > > > kvm_pmu_refresh(vcpu); > > > > vcpu->arch.cr4_guest_rsvd_bits = > > > > __cr4_reserved_bits(guest_cpuid_has, vcpu); > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > index 3421ed7fcee0..cb02a7c2938b 100644 > > > > --- a/arch/x86/kvm/x86.c > > > > +++ b/arch/x86/kvm/x86.c > > > > @@ -3699,12 +3699,19 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) > > > > data &= ~(u64)0x40; /* ignore flush filter disable */ > > > > data &= ~(u64)0x100; /* ignore ignne emulation enable */ > > > > data &= ~(u64)0x8; /* ignore TLB cache disable */ > > > > + data &= ~(u64)0x1000000;/* ignore TscFreqSel */ > > > > > > > > /* Handle McStatusWrEn */ > > > > if (data & ~BIT_ULL(18)) { > > > > kvm_pr_unimpl_wrmsr(vcpu, msr, data); > > > > return 1; > > > > } > > > > + > > > > + /* > > > > + * When set, TscFreqSel is read-only. Attempts to > > > > + * clear it are ignored. > > > > + */ > > > > + data |= vcpu->arch.msr_hwcr & BIT_ULL(24); > > > > > > > > > The bit is read-only from the guest, but KVM needs to let userspace clear the > > > bit. > > > > Why? We don't let userspace clear bit 1 of EFLAGS, which is also a > > "reads as one" bit. > > Because that's architectural behavior, not dependent on FMS, and KVM needs to > write EFLAGS to have any hope of being useful, i.e. giving ownership of EFLAGS > to userspace is not a realistic option. Remind me what "MSR" stands for. :) > As proposed, if userspace sets CPUID to a magic FMS, and then changes the FMS to > something else, kvm_vcpu_after_set_cpuid() will not clear the bit and KVM will > end up wrongly enumerating the bit. I doubt userspace would ever do that, but > it's at least possible. > > That could be fixed by actively clearing vcpu->arch.msr_hwcr for other FMS values, > but then KVM would have to be 100% precise on the FMS matching, which would be a > maintenance nightmare. What if I did something crude like we do for MSR_IA32_MISC_ENABLE, and just set the bit at reset regardless of FMS: diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index cb02a7c2938b..4d7d0de42a9d 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12086,6 +12086,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) vcpu->arch.msr_misc_features_enables = 0; vcpu->arch.ia32_misc_enable_msr = MSR_IA32_MISC_ENABLE_PEBS_UNAVAIL | MSR_IA32_MISC_ENABLE_BTS_UNAVAIL; + vcpu_arch.msr_hwcr = BIT_ULL(24); __kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP); __kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true); > In other words, userspace owns the vCPU model, and for good reasons. KVM needs > to allow userspace to define a sane model, but with *very* few exceptions, KVM > should not try to "help" userspace by stuffing bits. Okay. What about the IA32_MISC_ENABLE bits above? > Pretty much everytime KVM tries to help, it causes problems. E.g. initializing > perf_capabilities to kvm_caps.supported_perf_cap seems like a good thing, except > it presents a bogus model if userspace decides to not enumerate a vPMU to the > guest (Aaron was allegedly going to send a patch for this...). KVM is nothing if not inconsistent.