Hi Liang, On 11/21/22 6:28 AM, Liang Yan wrote: > A little bit more information from kernel perspective. > > https://urldefense.com/v3/__https://lkml.org/lkml/2022/10/31/476__;!!ACWV5N9M2RV99hQ!NHxyuDAt7ZD4hlsoxPCIUSRsPzaii0kDx2DrS7umBMoKVD8Z6BH7IKvPu8p0EhBBTEqQkCMfTk1xoj-XzT0$ > > > I was kindly thinking of the same idea, but not sure if it is expected from a > bare-metal perspective, since the four legacy MSRs > > are always there. Also not sure if they are used by other applications. > > > ~Liang Can I assume you were replying to the the patch ... [PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled ... but not this one which is about to save/restore PMU registers? [PATCH 3/3] target/i386/kvm: get and put AMD pmu registers About the legacy registers, here is my understanding just based on the linux amd source code. The line 1450 indicates the PMU initialization is failed only when the family is < 6. The amd_core_pmu_init() at line 1455 will decide whether to use PERFCTR_CORE. 1445 __init int amd_pmu_init(void) 1446 { 1447 int ret; 1448 1449 /* Performance-monitoring supported from K7 and later: */ 1450 if (boot_cpu_data.x86 < 6) 1451 return -ENODEV; 1452 1453 x86_pmu = amd_pmu; 1454 1455 ret = amd_core_pmu_init(); 1456 if (ret) 1457 return ret By default (when family < 6), the below 4 legacy registers are used. 1270 static __initconst const struct x86_pmu amd_pmu = { 1271 .name = "AMD", ... ... 1279 .eventsel = MSR_K7_EVNTSEL0, 1280 .perfctr = MSR_K7_PERFCTR0, #define MSR_K7_EVNTSEL0 0xc0010000 #define MSR_K7_PERFCTR0 0xc0010004 #define MSR_K7_EVNTSEL1 0xc0010001 #define MSR_K7_PERFCTR1 0xc0010005 #define MSR_K7_EVNTSEL2 0xc0010002 #define MSR_K7_PERFCTR2 0xc0010006 #define MSR_K7_EVNTSEL3 0xc0010003 #define MSR_K7_PERFCTR3 0xc0010007 If PERFCTR_CORE is supported, counters like below are used, as line 1368 and 1369. 1351 static int __init amd_core_pmu_init(void) 1352 { 1353 union cpuid_0x80000022_ebx ebx; 1354 u64 even_ctr_mask = 0ULL; 1355 int i; 1356 1357 if (!boot_cpu_has(X86_FEATURE_PERFCTR_CORE)) 1358 return 0; 1359 1360 /* Avoid calculating the value each time in the NMI handler */ 1361 perf_nmi_window = msecs_to_jiffies(100); 1362 1363 /* 1364 * If core performance counter extensions exists, we must use 1365 * MSR_F15H_PERF_CTL/MSR_F15H_PERF_CTR msrs. See also 1366 * amd_pmu_addr_offset(). 1367 */ 1368 x86_pmu.eventsel = MSR_F15H_PERF_CTL; 1369 x86_pmu.perfctr = MSR_F15H_PERF_CTR; 1370 x86_pmu.num_counters = AMD64_NUM_COUNTERS_CORE; #define MSR_F15H_PERF_CTL 0xc0010200 #define MSR_F15H_PERF_CTL0 MSR_F15H_PERF_CTL #define MSR_F15H_PERF_CTL1 (MSR_F15H_PERF_CTL + 2) #define MSR_F15H_PERF_CTL2 (MSR_F15H_PERF_CTL + 4) #define MSR_F15H_PERF_CTL3 (MSR_F15H_PERF_CTL + 6) #define MSR_F15H_PERF_CTL4 (MSR_F15H_PERF_CTL + 8) #define MSR_F15H_PERF_CTL5 (MSR_F15H_PERF_CTL + 10) #define MSR_F15H_PERF_CTR 0xc0010201 #define MSR_F15H_PERF_CTR0 MSR_F15H_PERF_CTR #define MSR_F15H_PERF_CTR1 (MSR_F15H_PERF_CTR + 2) #define MSR_F15H_PERF_CTR2 (MSR_F15H_PERF_CTR + 4) #define MSR_F15H_PERF_CTR3 (MSR_F15H_PERF_CTR + 6) #define MSR_F15H_PERF_CTR4 (MSR_F15H_PERF_CTR + 8) #define MSR_F15H_PERF_CTR5 (MSR_F15H_PERF_CTR + 10) Unfortunately, I do not have access to machine with family < 6. It might be interesting to use QEMU to emulate a machine with family < 6. Thank you very much for suggestion! Dongli Zhang > > > On 11/19/22 07:29, Dongli Zhang wrote: >> The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM >> side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to >> the KVM side. >> >> However, only the Intel gp/fixed/global pmu registers are involved. There >> is not any implementation for AMD pmu registers. The >> 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are >> calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for >> AMD. Before AMD PerfMonV2, the number of gp registers is decided based on >> the CPU version. >> >> This patch is to add the support for AMD version=1 pmu, to get and put AMD >> pmu registers. Otherwise, there will be a bug: >> >> 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it >> is running "perf top". The pmu registers are not disabled gracefully. >> >> 2. Although the x86_cpu_reset() resets many registers to zero, the >> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result, >> some pmu events are still enabled at the KVM side. >> >> 3. The KVM pmc_speculative_in_use() always returns true so that the events >> will not be reclaimed. The kvm_pmc->perf_event is still active. >> >> 4. After the reboot, the VM kernel reports below error: >> >> [ 0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected, >> complain to your hardware vendor. >> [ 0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR >> c0010200 is 530076) >> >> 5. In a worse case, the active kvm_pmc->perf_event is still able to >> inject unknown NMIs randomly to the VM kernel. >> >> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0. >> >> The patch is to fix the issue by resetting AMD pmu registers during the >> reset. >> >> Cc: Joe Jin <joe.jin@xxxxxxxxxx> >> Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx> >> --- >> target/i386/cpu.h | 5 +++ >> target/i386/kvm/kvm.c | 83 +++++++++++++++++++++++++++++++++++++++++-- >> 2 files changed, 86 insertions(+), 2 deletions(-) >> >> diff --git a/target/i386/cpu.h b/target/i386/cpu.h >> index d4bc19577a..4cf0b98817 100644 >> --- a/target/i386/cpu.h >> +++ b/target/i386/cpu.h >> @@ -468,6 +468,11 @@ typedef enum X86Seg { >> #define MSR_CORE_PERF_GLOBAL_CTRL 0x38f >> #define MSR_CORE_PERF_GLOBAL_OVF_CTRL 0x390 >> +#define MSR_K7_EVNTSEL0 0xc0010000 >> +#define MSR_K7_PERFCTR0 0xc0010004 >> +#define MSR_F15H_PERF_CTL0 0xc0010200 >> +#define MSR_F15H_PERF_CTR0 0xc0010201 >> + >> #define MSR_MC0_CTL 0x400 >> #define MSR_MC0_STATUS 0x401 >> #define MSR_MC0_ADDR 0x402 >> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c >> index 0b1226ff7f..023fcbce48 100644 >> --- a/target/i386/kvm/kvm.c >> +++ b/target/i386/kvm/kvm.c >> @@ -2005,6 +2005,32 @@ int kvm_arch_init_vcpu(CPUState *cs) >> } >> } >> + if (IS_AMD_CPU(env)) { >> + int64_t family; >> + >> + family = (env->cpuid_version >> 8) & 0xf; >> + if (family == 0xf) { >> + family += (env->cpuid_version >> 20) & 0xff; >> + } >> + >> + /* >> + * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to >> + * disable the AMD pmu virtualization. >> + * >> + * If KVM_CAP_PMU_CAPABILITY is supported, "!has_pmu_cap" indicates >> + * the KVM side has already disabled the pmu virtualization. >> + */ >> + if (family >= 6 && (!has_pmu_cap || cpu->enable_pmu)) { >> + has_architectural_pmu_version = 1; >> + >> + if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) { >> + num_architectural_pmu_gp_counters = 6; >> + } else { >> + num_architectural_pmu_gp_counters = 4; >> + } >> + } >> + } >> + >> cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused); >> for (i = 0x80000000; i <= limit; i++) { >> @@ -3326,7 +3352,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level) >> kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, >> env->poll_control_msr); >> } >> - if (has_architectural_pmu_version > 0) { >> + if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { >> if (has_architectural_pmu_version > 1) { >> /* Stop the counter. */ >> kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); >> @@ -3357,6 +3383,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level) >> env->msr_global_ctrl); >> } >> } >> + >> + if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { >> + uint32_t sel_base = MSR_K7_EVNTSEL0; >> + uint32_t ctr_base = MSR_K7_PERFCTR0; >> + uint32_t step = 1; >> + >> + if (num_architectural_pmu_gp_counters == 6) { >> + sel_base = MSR_F15H_PERF_CTL0; >> + ctr_base = MSR_F15H_PERF_CTR0; >> + step = 2; >> + } >> + >> + for (i = 0; i < num_architectural_pmu_gp_counters; i++) { >> + kvm_msr_entry_add(cpu, ctr_base + i * step, >> + env->msr_gp_counters[i]); >> + kvm_msr_entry_add(cpu, sel_base + i * step, >> + env->msr_gp_evtsel[i]); >> + } >> + } >> + >> /* >> * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add, >> * only sync them to KVM on the first cpu >> @@ -3817,7 +3863,7 @@ static int kvm_get_msrs(X86CPU *cpu) >> if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) { >> kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1); >> } >> - if (has_architectural_pmu_version > 0) { >> + if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) { >> if (has_architectural_pmu_version > 1) { >> kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0); >> kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0); >> @@ -3833,6 +3879,25 @@ static int kvm_get_msrs(X86CPU *cpu) >> } >> } >> + if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) { >> + uint32_t sel_base = MSR_K7_EVNTSEL0; >> + uint32_t ctr_base = MSR_K7_PERFCTR0; >> + uint32_t step = 1; >> + >> + if (num_architectural_pmu_gp_counters == 6) { >> + sel_base = MSR_F15H_PERF_CTL0; >> + ctr_base = MSR_F15H_PERF_CTR0; >> + step = 2; >> + } >> + >> + for (i = 0; i < num_architectural_pmu_gp_counters; i++) { >> + kvm_msr_entry_add(cpu, ctr_base + i * step, >> + env->msr_gp_counters[i]); >> + kvm_msr_entry_add(cpu, sel_base + i * step, >> + env->msr_gp_evtsel[i]); >> + } >> + } >> + >> if (env->mcg_cap) { >> kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0); >> kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0); >> @@ -4118,6 +4183,20 @@ static int kvm_get_msrs(X86CPU *cpu) >> case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1: >> env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data; >> break; >> + case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3: >> + env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data; >> + break; >> + case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3: >> + env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data; >> + break; >> + case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb: >> + index = index - MSR_F15H_PERF_CTL0; >> + if (index & 0x1) { >> + env->msr_gp_counters[index] = msrs[i].data; >> + } else { >> + env->msr_gp_evtsel[index] = msrs[i].data; >> + } >> + break; >> case HV_X64_MSR_HYPERCALL: >> env->msr_hv_hypercall = msrs[i].data; >> break;