On Wed, Jan 24, 2024, Sean Christopherson wrote: > On Wed, Jan 24, 2024, Aaron Lewis wrote: > > On Wed, Jan 24, 2024 at 7:49 AM Sean Christopherson <seanjc@xxxxxxxxxx> wrote: > > > > > > On Wed, Jan 24, 2024, Mingwei Zhang wrote: > > > > Reset vcpu->arch.perf_capabilities to 0 if PDCM is disabled in guest cpuid. > > > > Without this, there is an issue in live migration. In particular, to > > > > migrate a VM with no PDCM enabled, VMM on the source is able to retrieve a > > > > non-zero value by reading the MSR_IA32_PERF_CAPABILITIES. However, VMM on > > > > the target is unable to set the value. This creates confusions on the user > > > > side. > > > > > > > > Fundamentally, it is because vcpu->arch.perf_capabilities as the cached > > > > value of MSR_IA32_PERF_CAPABILITIES is incorrect, and there is nothing > > > > wrong on the kvm_get_msr_common() which just reads > > > > vcpu->arch.perf_capabilities. > > > > > > > > Fix the issue by adding the reset code in kvm_vcpu_after_set_cpuid(), i.e. > > > > early in VM setup time. > > > > > > > > Cc: Aaron Lewis <aaronlewis@xxxxxxxxxx> > > > > Signed-off-by: Mingwei Zhang <mizhang@xxxxxxxxxx> > > > > --- > > > > arch/x86/kvm/cpuid.c | 3 +++ > > > > 1 file changed, 3 insertions(+) > > > > > > > > diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c > > > > index adba49afb5fe..416bee03c42a 100644 > > > > --- a/arch/x86/kvm/cpuid.c > > > > +++ b/arch/x86/kvm/cpuid.c > > > > @@ -369,6 +369,9 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu) > > > > vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu); > > > > vcpu->arch.reserved_gpa_bits = kvm_vcpu_reserved_gpa_bits_raw(vcpu); > > > > > > > > + /* Reset MSR_IA32_PERF_CAPABILITIES guest value to 0 if PDCM is off. */ > > > > + if (!guest_cpuid_has(vcpu, X86_FEATURE_PDCM)) > > > > + vcpu->arch.perf_capabilities = 0; > > > > > > No, this is just papering over the underlying bug. KVM shouldn't be stuffing > > > vcpu->arch.perf_capabilities without explicit writes from host userspace. E.g > > > KVM_SET_CPUID{,2} is allowed multiple times, at which point KVM could clobber a > > > host userspace write to MSR_IA32_PERF_CAPABILITIES. It's unlikely any userspace > > > actually does something like that, but KVM overwriting guest state is almost > > > never a good thing. > > > > > > I've been meaning to send a patch for a long time (IIRC, Aaron also ran into this?). > > > KVM needs to simply not stuff vcpu->arch.perf_capabilities. I believe we are > > > already fudging around this in our internal kernels, so I don't think there's a > > > need to carry a hack-a-fix for the destination kernel. > > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > index 27e23714e960..fdef9d706d61 100644 > > > --- a/arch/x86/kvm/x86.c > > > +++ b/arch/x86/kvm/x86.c > > > @@ -12116,7 +12116,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu) > > > > > > kvm_async_pf_hash_reset(vcpu); > > > > > > - vcpu->arch.perf_capabilities = kvm_caps.supported_perf_cap; > > > > Yeah, that will fix the issue we are seeing. The only thing that's > > not clear to me is if userspace should expect KVM to set this or if > > KVM should expect userspace to set this. How is that generally > > decided? > > By "this", you mean the effective RESET value for vcpu->arch.perf_capabilities? > To be consistent with KVM's CPUID module at vCPU creation, which is completely > empty (vCPU has no PMU and no PDCM support) KVM *must* zero > vcpu->arch.perf_capabilities. > > If userspace wants a non-zero value, then userspace needs to set CPUID to enable > PDCM and set MSR_IA32_PERF_CAPABILITIES. > > MSR_IA32_ARCH_CAPABILITIES is in the same boat, e.g. a vCPU without > X86_FEATURE_ARCH_CAPABILITIES can end up seeing a non-zero MSR value. That too > should be excised. > hmm, does that mean KVM just allows an invalid vcpu state exist from host point of view? I think this makes a lot of confusions on migration where VMM on the source believes that a non-zero value from KVM_GET_MSRS is valid and the VMM on the target will find it not true. If we follow the suggestion by removing the initial value at vCPU creation time, then I think it breaks the existing VMM code, since that requires VMM to explicitly set the MSR, which I am not sure we do today. The following code below is different. The key difference is that the following code preserves a valid value, but this case is to not preserve an invalid value. Thanks. -Mingwei > In a perfect world, KVM would also zero-initialize vcpu->arch.msr_platform_info, > but that one is less obviously broken and also less obviously safe to remove. > > commit e53d88af63ab4104e1226b8f9959f1e9903da10b > Author: Jim Mattson <jmattson@xxxxxxxxxx> > AuthorDate: Tue Oct 30 12:20:21 2018 -0700 > Commit: Paolo Bonzini <pbonzini@xxxxxxxxxx> > CommitDate: Fri Dec 14 18:00:01 2018 +0100 > > kvm: x86: Don't modify MSR_PLATFORM_INFO on vCPU reset > > If userspace has provided a different value for this MSR (e.g with the > turbo bits set), the userspace-provided value should survive a vCPU > reset. For backwards compatibility, MSR_PLATFORM_INFO is initialized > in kvm_arch_vcpu_setup. > > Signed-off-by: Jim Mattson <jmattson@xxxxxxxxxx> > Reviewed-by: Drew Schmitt <dasch@xxxxxxxxxx> > Cc: Abhiroop Dabral <adabral@xxxxxxxxxxxxxxxxxxxx> > Signed-off-by: Paolo Bonzini <pbonzini@xxxxxxxxxx> > > In other words, KVM shouldn't define the vCPU model beyond the absolute bare > minimum that is required by the x86 architecture (as of P6 CPUs, which is more > or less the oldest CPU KVM can reasonably virtualize without carrying useless code).