On Wed, Jan 24, 2024, Sean Christopherson wrote: > On Wed, Jan 24, 2024, Mingwei Zhang wrote: > > On Wed, Jan 24, 2024, Sean Christopherson wrote: > > > On Wed, Jan 24, 2024, Aaron Lewis wrote: > > > > On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > > > > > On Wed, Jan 24, 2024, Mingwei Zhang wrote: > > > > > No, this is just papering over the underlying bug. KVM shouldn't be stuffing > > > > > vcpu->arch.perf_capabilities without explicit writes from host userspace. E.g > > > > > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a > > > > > host userspace write to MSR_IA32_PERF_CAPABILITIES. It's unlikely any userspace > > > > > actually does something like that, but KVM overwriting guest state is almost > > > > > never a good thing. > > > > > > > > > > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?). > > > > > KVM needs to simply not stuff vcpu->arch.perf_capabilities. I believe we are > > > > > already fudging around this in our internal kernels, so I don't think there's a > > > > > need to carry a hack-a-fix for the destination kernel. > > > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > > index 27e23714e960..fdef9d706d61 100644 > > > > > --- a/arch/x86/kvm/x86.c > > > > > +++ b/arch/x86/kvm/x86.c > > > > > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > > > > > > > > > > kvm_async_pf_hash_reset(vcpu); > > > > > > > > > > - vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap; > > > > > > > > Yeah, that will fix the issue we are seeing. The only thing that's > > > > not clear to me is if userspace should expect KVM to set this or if > > > > KVM should expect userspace to set this. How is that generally > > > > decided? > > > > > > By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities? > > > To be consistent with KVM's CPUID module at vCPU creation, which is completely > > > empty (vCPU has no PMU and no PDCM support) KVM *must* zero > > > vcpu->arch.perf_capabilities. > > > > > > If userspace wants a non-zero value, then userspace needs to set CPUID to enable > > > PDCM and set MSR_IA32_PERF_CAPABILITIES. > > > > > > MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without > > > X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value. That too > > > should be excised. > > > > > hmm, does that mean KVM just allows an invalid vcpu state exist from > > host point of view? > > Yes. > > https://lore.kernel.org/all/ZC4qF90l77m3X1Ir@xxxxxxxxxx > > > I think this makes a lot of confusions on migration where VMM on the source > > believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the > > target will find it not true. > > Yes, but seeing a non-zero value is a KVM bug that should be fixed. > How about adding an entry in vmx_get_msr() for MSR_IA32_PERF_CAPABILITIES and check pmu_version? This basically pairs with the implementation in vmx_set_msr() for MSR_IA32_PERF_CAPABILITIES. Doing so allows KVM_GET_MSRS return 0 for the MSR instead of returning the initial permitted value. The benefit is that it is not enforcing the VMM to explicitly set the value. In fact, there are several platform MSRs which has initial value that VMM may rely on instead of explicitly setting. MSR_IA32_PERF_CAPABILITIES is only one of them. > > If we follow the suggestion by removing the initial value at vCPU > > creation time, then I think it breaks the existing VMM code, since that > > requires VMM to explicitly set the MSR, which I am not sure we do today. > > Yeah, I'm hoping we can squeak by without breaking existing setups. > > I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES > since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643 > ("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES"). > > Frankly, if our VMM doesn't do the same, then it's wildly busted. Relying on > KVM to define the vCPU is irresponsible, to put it nicely. > > > The following code below is different. The key difference is that the > > following code preserves a valid value, but this case is to not preserve > > an invalid value. > > But it's a completely different fix. I referenced that commit to call out that > the need for the commit and changelog suggests that someone (*cough* us) is relying > on KVM to initialize MSR_PLATFORM_INFO, and has been doing so for a very long time. > That doesn't mean it's the correct KVM behavior, just that it's much riskier to > change.