On Wed, Jan 24, 2024, Mingwei Zhang wrote: > On Wed, Jan 24, 2024, Sean Christopherson wrote: > > On Wed, Jan 24, 2024, Aaron Lewis wrote: > > > On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > > > On Wed, Jan 24, 2024, Mingwei Zhang wrote: > > > > No, this is just papering over the underlying bug. KVM shouldn't be stuffing > > > > vcpu->arch.perf_capabilities without explicit writes from host userspace. E.g > > > > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a > > > > host userspace write to MSR_IA32_PERF_CAPABILITIES. It's unlikely any userspace > > > > actually does something like that, but KVM overwriting guest state is almost > > > > never a good thing. > > > > > > > > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?). > > > > KVM needs to simply not stuff vcpu->arch.perf_capabilities. I believe we are > > > > already fudging around this in our internal kernels, so I don't think there's a > > > > need to carry a hack-a-fix for the destination kernel. > > > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > > index 27e23714e960..fdef9d706d61 100644 > > > > --- a/arch/x86/kvm/x86.c > > > > +++ b/arch/x86/kvm/x86.c > > > > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > > > > > > > > kvm_async_pf_hash_reset(vcpu); > > > > > > > > - vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap; > > > > > > Yeah, that will fix the issue we are seeing. The only thing that's > > > not clear to me is if userspace should expect KVM to set this or if > > > KVM should expect userspace to set this. How is that generally > > > decided? > > > > By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities? > > To be consistent with KVM's CPUID module at vCPU creation, which is completely > > empty (vCPU has no PMU and no PDCM support) KVM *must* zero > > vcpu->arch.perf_capabilities. > > > > If userspace wants a non-zero value, then userspace needs to set CPUID to enable > > PDCM and set MSR_IA32_PERF_CAPABILITIES. > > > > MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without > > X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value. That too > > should be excised. > > > hmm, does that mean KVM just allows an invalid vcpu state exist from > host point of view? Yes. https://lore.kernel.org/all/ZC4qF90l77m3X1Ir@xxxxxxxxxx > I think this makes a lot of confusions on migration where VMM on the source > believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the > target will find it not true. Yes, but seeing a non-zero value is a KVM bug that should be fixed. > If we follow the suggestion by removing the initial value at vCPU > creation time, then I think it breaks the existing VMM code, since that > requires VMM to explicitly set the MSR, which I am not sure we do today. Yeah, I'm hoping we can squeak by without breaking existing setups. I'm 99% certain QEMU is ok, as QEMU has explicitly set MSR_IA32_PERF_CAPABILITIES since support for PDCM/PERF_CAPABILITIES was added by commit ea39f9b643 ("target/i386: define a new MSR based feature word - FEAT_PERF_CAPABILITIES"). Frankly, if our VMM doesn't do the same, then it's wildly busted. Relying on KVM to define the vCPU is irresponsible, to put it nicely. > The following code below is different. The key difference is that the > following code preserves a valid value, but this case is to not preserve > an invalid value. But it's a completely different fix. I referenced that commit to call out that the need for the commit and changelog suggests that someone (*cough* us) is relying on KVM to initialize MSR_PLATFORM_INFO, and has been doing so for a very long time. That doesn't mean it's the correct KVM behavior, just that it's much riskier to change.