Hi Maksim, On 11/7/24 1:00 PM, Maksim Davydov wrote: > > > On 11/4/24 12:40, Dongli Zhang wrote: >> QEMU uses the kvm_get_msrs() function to save Intel PMU registers from KVM >> and kvm_put_msrs() to restore them to KVM. However, there is no support for >> AMD PMU registers. Currently, has_pmu_version and num_pmu_gp_counters are >> initialized based on cpuid(0xa), which does not apply to AMD processors. >> For AMD CPUs, prior to PerfMonV2, the number of general-purpose registers >> is determined based on the CPU version. >> >> To address this issue, we need to add support for AMD PMU registers. >> Without this support, the following problems can arise: >> >> 1. If the VM is reset (e.g., via QEMU system_reset or VM kdump/kexec) while >> running "perf top", the PMU registers are not disabled properly. >> >> 2. Despite x86_cpu_reset() resetting many registers to zero, kvm_put_msrs() >> does not handle AMD PMU registers, causing some PMU events to remain >> enabled in KVM. >> >> 3. The KVM kvm_pmc_speculative_in_use() function consistently returns true, >> preventing the reclamation of these events. Consequently, the >> kvm_pmc->perf_event remains active. >> >> 4. After a reboot, the VM kernel may report the following error: >> >> [ 0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, >> complain to your hardware vendor. >> [ 0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR >> c0010200 is 530076) >> >> 5. In the worst case, the active kvm_pmc->perf_event may inject unknown >> NMIs randomly into the VM kernel: >> >> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. >> >> To resolve these issues, we propose resetting AMD PMU registers during the >> VM reset process. >> >> Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx> >> --- >> target/i386/cpu.h | 8 +++ >> target/i386/kvm/kvm.c | 156 +++++++++++++++++++++++++++++++++++++++++- >> 2 files changed, 161 insertions(+), 3 deletions(-) >> >> diff --git a/target/i386/cpu.h b/target/i386/cpu.h >> index 59959b8b7a..0505eb3b08 100644 >> --- a/target/i386/cpu.h >> +++ b/target/i386/cpu.h >> @@ -488,6 +488,14 @@ typedef enum X86Seg { >> #define MSR_CORE_PERF_GLOBAL_CTRL 0x38f >> #define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x390 >> +#define MSR_K7_EVNTSEL0 0xc0010000 >> +#define MSR_K7_PERFCTR0 0xc0010004 >> +#define MSR_F15H_PERF_CTL0 0xc0010200 >> +#define MSR_F15H_PERF_CTR0 0xc0010201 >> + >> +#define AMD64_NUM_COUNTERS 4 >> +#define AMD64_NUM_COUNTERS_CORE 6 >> + >> #define MSR_MC0_CTL 0x400 >> #define MSR_MC0_STATUS 0x401 >> #define MSR_MC0_ADDR 0x402 >> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c >> index ca2b644e2c..83ec85a9b9 100644 >> --- a/target/i386/kvm/kvm.c >> +++ b/target/i386/kvm/kvm.c >> @@ -2035,7 +2035,7 @@ full: >> abort(); >> } >> -static void kvm_init_pmu_info(CPUX86State *env) >> +static void kvm_init_pmu_info_intel(CPUX86State *env) >> { >> uint32_t eax, edx; >> uint32_t unused; >> @@ -2072,6 +2072,80 @@ static void kvm_init_pmu_info(CPUX86State *env) >> } >> } >> +static void kvm_init_pmu_info_amd(CPUX86State *env) >> +{ >> + int64_t family; >> + >> + has_pmu_version = 0; >> + >> + /* >> + * To determine the CPU family, the following code is derived from >> + * x86_cpuid_version_get_family(). >> + */ >> + family = (env->cpuid_version >> 8) & 0xf; >> + if (family == 0xf) { >> + family += (env->cpuid_version >> 20) & 0xff; >> + } >> + >> + /* >> + * Performance-monitoring supported from K7 and later. >> + */ >> + if (family < 6) { >> + return; >> + } >> + >> + has_pmu_version = 1; >> + >> + if (!(env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE)) { >> + num_pmu_gp_counters = AMD64_NUM_COUNTERS; >> + return; >> + } >> + >> + num_pmu_gp_counters = AMD64_NUM_COUNTERS_CORE; >> +} > > It seems that AMD implementation has one issue. > KVM has parameter `enable_pmu`. So vPMU can be disabled in another way, not only > via KVM_PMU_CAP_DISABLE. For Intel it's not a problem, because the vPMU > initialization uses info from KVM_GET_SUPPORTED_CPUID. The enable_pmu state is > reflected in KVM_GET_SUPPORTED_CPUID. Thus no PMU MSRs in kvm_put_msrs/ > kvm_get_msrs will be used. > > But on AMD we don't use information from KVM_GET_SUPPORTED_CPUID to set an > appropriate number of PMU registers. So, if vPMU is disabled by KVM parameter > `enable_pmu` and pmu-cap-disable=false, then has_pmu_version will be 1 after > kvm_init_pmu_info_amd execution. It means that in kvm_put_msrs/kvm_get_msrs 4 > PMU counters will be processed, but the correct behavior in that situation is to > skip all PMU registers. > I think we should get info from KVM to fix that. > > I tested this series on Zen2 and found that PMU MSRs were still processed during > initialization even with enable_pmu=N. But it doesn't lead to any errors in QEMU Thank you very much for the feedback and helping catch the bug! When enable_pmu=N, the QEMU (with this patchset) cannot tell if vPMU is supported via KVM_CAP_PMU_CAPABILITY. As it cannot disable the PMU, it falls to the legacy 4 counters. It falls to 4 counters because KVM disableds PERFCORE on enable_pmu=Y, i.e., 5220 if (enable_pmu) { 5221 /* 5222 * Enumerate support for PERFCTR_CORE if and only if KVM has 5223 * access to enough counters to virtualize "core" support, 5224 * otherwise limit vPMU support to the legacy number of counters. 5225 */ 5226 if (kvm_pmu_cap.num_counters_gp < AMD64_NUM_COUNTERS_CORE) 5227 kvm_pmu_cap.num_counters_gp = min(AMD64_NUM_COUNTERS, 5228 kvm_pmu_cap.num_counters_gp); 5229 else 5230 kvm_cpu_cap_check_and_set(X86_FEATURE_PERFCTR_CORE); 5231 5232 if (kvm_pmu_cap.version != 2 || 5233 !kvm_cpu_cap_has(X86_FEATURE_PERFCTR_CORE)) 5234 kvm_cpu_cap_clear(X86_FEATURE_PERFMON_V2); 5235 } During the bootup and reset, the QEMU (with this patchset) erroneously resets MSRs for the 4 PMCs, via line 3827. 3825 static int kvm_buf_set_msrs(X86CPU *cpu) 3826 { 3827 int ret = kvm_vcpu_ioctl(CPU(cpu), KVM_SET_MSRS, cpu->kvm_msr_buf); 3828 if (ret < 0) { 3829 return ret; 3830 } 3831 3832 if (ret < cpu->kvm_msr_buf->nmsrs) { 3833 struct kvm_msr_entry *e = &cpu->kvm_msr_buf->entries[ret]; 3834 error_report("error: failed to set MSR 0x%" PRIx32 " to 0x%" PRIx64, 3835 (uint32_t)e->index, (uint64_t)e->data); 3836 } 3837 3838 assert(ret == cpu->kvm_msr_buf->nmsrs); 3839 return 0; 3840 } Because enable_pmu=N, the KVM doesn't support those registers. However, it returns 0 (not 1), because the KVM does nothing in the implicit else (i.e., line 4144). 3847 int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info) 3848 { ... ... 4138 case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3: 4139 case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1: 4140 case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL3: 4141 case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1: 4142 if (kvm_pmu_is_valid_msr(vcpu, msr)) 4143 return kvm_pmu_set_msr(vcpu, msr_info); 4144 4145 if (data) 4146 kvm_pr_unimpl_wrmsr(vcpu, msr, data); 4147 break; ... ... 4224 default: 4225 if (kvm_pmu_is_valid_msr(vcpu, msr)) 4226 return kvm_pmu_set_msr(vcpu, msr_info); 4227 4228 /* 4229 * Userspace is allowed to write '0' to MSRs that KVM reports 4230 * as to-be-saved, even if an MSRs isn't fully supported. 4231 */ 4232 if (msr_info->host_initiated && !data && 4233 kvm_is_msr_to_save(msr)) 4234 break; 4235 4236 return KVM_MSR_RET_INVALID; 4237 } 4238 return 0; 4239 } 4240 EXPORT_SYMBOL_GPL(kvm_set_msr_common); Fortunately, it returns 0 at line 4238. No error is detected by QEMU. Perhaps I may need to send message with a small patch to return 1 in the implicit 'else' to kvm mailing list to confirm if that is expected. However, the answer is very likely 'expected', because line 4229 to line 4230 already explain it. Regarding the change in QEMU: Since kvm_check_extension(s, KVM_CAP_PMU_CAPABILITY) returns 0 for both (1) enable_pmu=Y, and (2) KVM_CAP_PMU_CAPABILITY not supported, I may need to use g_file_get_contents() to read from the KVM sysfs parameters (similar to KVM selftest kvm_is_pmu_enabled()). > >> + >> +static bool is_same_vendor(CPUX86State *env) >> +{ >> + static uint32_t host_cpuid_vendor1; >> + static uint32_t host_cpuid_vendor2; >> + static uint32_t host_cpuid_vendor3; >> + >> + host_cpuid(0x0, 0, NULL, &host_cpuid_vendor1, &host_cpuid_vendor3, >> + &host_cpuid_vendor2); >> + >> + return env->cpuid_vendor1 == host_cpuid_vendor1 && >> + env->cpuid_vendor2 == host_cpuid_vendor2 && >> + env->cpuid_vendor3 == host_cpuid_vendor3; >> +} >> + >> +static void kvm_init_pmu_info(CPUX86State *env) >> +{ >> + /* >> + * It is not supported to virtualize AMD PMU registers on Intel >> + * processors, nor to virtualize Intel PMU registers on AMD processors. >> + */ >> + if (!is_same_vendor(env)) { >> + return; >> + } >> + >> + /* >> + * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to >> + * disable the AMD pmu virtualization. >> + * >> + * If KVM_CAP_PMU_CAPABILITY is supported, kvm_state->pmu_cap_disabled >> + * indicates the KVM has already disabled the pmu virtualization. >> + */ >> + if (kvm_state->pmu_cap_disabled) { >> + return; >> + } >> + > > It seems that after these changes the issue concerning using > pmu-cap-disable=true with +pmu on Intel platform (that Zhao Liu has mentioned > before) is fixed Can I assume you were going to paste some code below? Regardless, I am going to following Zhao's suggestion to revert back to my previous solution. Thank you very much for the feedback! Dongli Zhang > >> + if (IS_INTEL_CPU(env)) { >> + kvm_init_pmu_info_intel(env); >> + } else if (IS_AMD_CPU(env)) { >> + kvm_init_pmu_info_amd(env); >> + } >> +} >> + >> int kvm_arch_init_vcpu(CPUState *cs) >> { >> struct { >> @@ -4027,7 +4101,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) >> kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, env- >> >poll_control_msr); >> } >> - if (has_pmu_version > 0) { >> + if (IS_INTEL_CPU(env) && has_pmu_version > 0) { >> if (has_pmu_version > 1) { >> /* Stop the counter. */ >> kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); >> @@ -4058,6 +4132,38 @@ static int kvm_put_msrs(X86CPU *cpu, int level) >> env->msr_global_ctrl); >> } >> } >> + >> + if (IS_AMD_CPU(env) && has_pmu_version > 0) { >> + uint32_t sel_base = MSR_K7_EVNTSEL0; >> + uint32_t ctr_base = MSR_K7_PERFCTR0; >> + /* >> + * The address of the next selector or counter register is >> + * obtained by incrementing the address of the current selector >> + * or counter register by one. >> + */ >> + uint32_t step = 1; >> + >> + /* >> + * When PERFCORE is enabled, AMD PMU uses a separate set of >> + * addresses for the selector and counter registers. >> + * Additionally, the address of the next selector or counter >> + * register is determined by incrementing the address of the >> + * current register by two. >> + */ >> + if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) { >> + sel_base = MSR_F15H_PERF_CTL0; >> + ctr_base = MSR_F15H_PERF_CTR0; >> + step = 2; >> + } >> + >> + for (i = 0; i < num_pmu_gp_counters; i++) { >> + kvm_msr_entry_add(cpu, ctr_base + i * step, >> + env->msr_gp_counters[i]); >> + kvm_msr_entry_add(cpu, sel_base + i * step, >> + env->msr_gp_evtsel[i]); >> + } >> + } >> + >> /* >> * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add, >> * only sync them to KVM on the first cpu >> @@ -4503,7 +4609,8 @@ static int kvm_get_msrs(X86CPU *cpu) >> if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) { >> kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1); >> } >> - if (has_pmu_version > 0) { >> + >> + if (IS_INTEL_CPU(env) && has_pmu_version > 0) { >> if (has_pmu_version > 1) { >> kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); >> kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0); >> @@ -4519,6 +4626,35 @@ static int kvm_get_msrs(X86CPU *cpu) >> } >> } >> + if (IS_AMD_CPU(env) && has_pmu_version > 0) { >> + uint32_t sel_base = MSR_K7_EVNTSEL0; >> + uint32_t ctr_base = MSR_K7_PERFCTR0; >> + /* >> + * The address of the next selector or counter register is >> + * obtained by incrementing the address of the current selector >> + * or counter register by one. >> + */ >> + uint32_t step = 1; >> + >> + /* >> + * When PERFCORE is enabled, AMD PMU uses a separate set of >> + * addresses for the selector and counter registers. >> + * Additionally, the address of the next selector or counter >> + * register is determined by incrementing the address of the >> + * current register by two. >> + */ >> + if (num_pmu_gp_counters == AMD64_NUM_COUNTERS_CORE) { >> + sel_base = MSR_F15H_PERF_CTL0; >> + ctr_base = MSR_F15H_PERF_CTR0; >> + step = 2; >> + } >> + >> + for (i = 0; i < num_pmu_gp_counters; i++) { >> + kvm_msr_entry_add(cpu, ctr_base + i * step, 0); >> + kvm_msr_entry_add(cpu, sel_base + i * step, 0); >> + } >> + } >> + >> if (env->mcg_cap) { >> kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0); >> kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0); >> @@ -4830,6 +4966,20 @@ static int kvm_get_msrs(X86CPU *cpu) >> case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1: >> env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data; >> break; >> + case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3: >> + env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data; >> + break; >> + case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3: >> + env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data; >> + break; >> + case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb: >> + index = index - MSR_F15H_PERF_CTL0; >> + if (index & 0x1) { >> + env->msr_gp_counters[index] = msrs[i].data; >> + } else { >> + env->msr_gp_evtsel[index] = msrs[i].data; >> + } >> + break; >> case HV_X64_MSR_HYPERCALL: >> env->msr_hv_hypercall = msrs[i].data; >> break; >