Re: [PATCH 3/3] target/i386/kvm: get and put AMD pmu registers

Dongli Zhang <dongli.zhang@xxxxxxxxxx> · Mon, 21 Nov 2022 13:33:05 -0800

Hi Liang,

On 11/21/22 6:28 AM, Liang Yan wrote:
> A little bit more information from kernel perspective.
> 
> https://urldefense.com/v3/__https://lkml.org/lkml/2022/10/31/476__;!!ACWV5N9M2RV99hQ!NHxyuDAt7ZD4hlsoxPCIUSRsPzaii0kDx2DrS7umBMoKVD8Z6BH7IKvPu8p0EhBBTEqQkCMfTk1xoj-XzT0$
> 
> 
> I was kindly thinking of the same idea, but not sure if it is expected  from a
> bare-metal perspective, since the four legacy MSRs
> 
> are always there. Also not sure if they are used by other applications.
> 
> 
> ~Liang

Can I assume you were replying to the the patch ...

[PATCH 2/3] i386: kvm: disable KVM_CAP_PMU_CAPABILITY if "pmu" is disabled

... but not this one which is about to save/restore PMU registers?

[PATCH 3/3] target/i386/kvm: get and put AMD pmu registers

About the legacy registers, here is my understanding just based on the linux amd
source code.

The line 1450 indicates the PMU initialization is failed only when the family is
< 6. The amd_core_pmu_init() at line 1455 will decide whether to use PERFCTR_CORE.

1445 __init int amd_pmu_init(void)
1446 {
1447         int ret;
1448
1449         /* Performance-monitoring supported from K7 and later: */
1450         if (boot_cpu_data.x86 < 6)
1451                 return -ENODEV;
1452
1453         x86_pmu = amd_pmu;
1454
1455         ret = amd_core_pmu_init();
1456         if (ret)
1457                 return ret

By default (when family < 6), the below 4 legacy registers are used.

1270 static __initconst const struct x86_pmu amd_pmu = {
1271         .name                   = "AMD",
... ...
1279         .eventsel               = MSR_K7_EVNTSEL0,
1280         .perfctr                = MSR_K7_PERFCTR0,

#define MSR_K7_EVNTSEL0                 0xc0010000
#define MSR_K7_PERFCTR0                 0xc0010004
#define MSR_K7_EVNTSEL1                 0xc0010001
#define MSR_K7_PERFCTR1                 0xc0010005
#define MSR_K7_EVNTSEL2                 0xc0010002
#define MSR_K7_PERFCTR2                 0xc0010006
#define MSR_K7_EVNTSEL3                 0xc0010003
#define MSR_K7_PERFCTR3                 0xc0010007

If PERFCTR_CORE is supported, counters like below are used, as line 1368 and 1369.

1351 static int __init amd_core_pmu_init(void)
1352 {
1353         union cpuid_0x80000022_ebx ebx;
1354         u64 even_ctr_mask = 0ULL;
1355         int i;
1356
1357         if (!boot_cpu_has(X86_FEATURE_PERFCTR_CORE))
1358                 return 0;
1359
1360         /* Avoid calculating the value each time in the NMI handler */
1361         perf_nmi_window = msecs_to_jiffies(100);
1362
1363         /*
1364          * If core performance counter extensions exists, we must use
1365          * MSR_F15H_PERF_CTL/MSR_F15H_PERF_CTR msrs. See also
1366          * amd_pmu_addr_offset().
1367          */
1368         x86_pmu.eventsel        = MSR_F15H_PERF_CTL;
1369         x86_pmu.perfctr         = MSR_F15H_PERF_CTR;
1370         x86_pmu.num_counters    = AMD64_NUM_COUNTERS_CORE;

#define MSR_F15H_PERF_CTL               0xc0010200
#define MSR_F15H_PERF_CTL0              MSR_F15H_PERF_CTL
#define MSR_F15H_PERF_CTL1              (MSR_F15H_PERF_CTL + 2)
#define MSR_F15H_PERF_CTL2              (MSR_F15H_PERF_CTL + 4)
#define MSR_F15H_PERF_CTL3              (MSR_F15H_PERF_CTL + 6)
#define MSR_F15H_PERF_CTL4              (MSR_F15H_PERF_CTL + 8)
#define MSR_F15H_PERF_CTL5              (MSR_F15H_PERF_CTL + 10)

#define MSR_F15H_PERF_CTR               0xc0010201
#define MSR_F15H_PERF_CTR0              MSR_F15H_PERF_CTR
#define MSR_F15H_PERF_CTR1              (MSR_F15H_PERF_CTR + 2)
#define MSR_F15H_PERF_CTR2              (MSR_F15H_PERF_CTR + 4)
#define MSR_F15H_PERF_CTR3              (MSR_F15H_PERF_CTR + 6)
#define MSR_F15H_PERF_CTR4              (MSR_F15H_PERF_CTR + 8)
#define MSR_F15H_PERF_CTR5              (MSR_F15H_PERF_CTR + 10)

Unfortunately, I do not have access to machine with family < 6. It might be
interesting to use QEMU to emulate a machine with family < 6.

Thank you very much for suggestion!

Dongli Zhang

> 
> 
> On 11/19/22 07:29, Dongli Zhang wrote:
>> The QEMU side calls kvm_get_msrs() to save the pmu registers from the KVM
>> side to QEMU, and calls kvm_put_msrs() to store the pmu registers back to
>> the KVM side.
>>
>> However, only the Intel gp/fixed/global pmu registers are involved. There
>> is not any implementation for AMD pmu registers. The
>> 'has_architectural_pmu_version' and 'num_architectural_pmu_gp_counters' are
>> calculated at kvm_arch_init_vcpu() via cpuid(0xa). This does not work for
>> AMD. Before AMD PerfMonV2, the number of gp registers is decided based on
>> the CPU version.
>>
>> This patch is to add the support for AMD version=1 pmu, to get and put AMD
>> pmu registers. Otherwise, there will be a bug:
>>
>> 1. The VM resets (e.g., via QEMU system_reset or VM kdump/kexec) while it
>> is running "perf top". The pmu registers are not disabled gracefully.
>>
>> 2. Although the x86_cpu_reset() resets many registers to zero, the
>> kvm_put_msrs() does not puts AMD pmu registers to KVM side. As a result,
>> some pmu events are still enabled at the KVM side.
>>
>> 3. The KVM pmc_speculative_in_use() always returns true so that the events
>> will not be reclaimed. The kvm_pmc->perf_event is still active.
>>
>> 4. After the reboot, the VM kernel reports below error:
>>
>> [    0.092011] Performance Events: Fam17h+ core perfctr, Broken BIOS detected,
>> complain to your hardware vendor.
>> [    0.092023] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR
>> c0010200 is 530076)
>>
>> 5. In a worse case, the active kvm_pmc->perf_event is still able to
>> inject unknown NMIs randomly to the VM kernel.
>>
>> [...] Uhhuh. NMI received for unknown reason 30 on CPU 0.
>>
>> The patch is to fix the issue by resetting AMD pmu registers during the
>> reset.
>>
>> Cc: Joe Jin <joe.jin@xxxxxxxxxx>
>> Signed-off-by: Dongli Zhang <dongli.zhang@xxxxxxxxxx>
>> ---
>>   target/i386/cpu.h     |  5 +++
>>   target/i386/kvm/kvm.c | 83 +++++++++++++++++++++++++++++++++++++++++--
>>   2 files changed, 86 insertions(+), 2 deletions(-)
>>
>> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
>> index d4bc19577a..4cf0b98817 100644
>> --- a/target/i386/cpu.h
>> +++ b/target/i386/cpu.h
>> @@ -468,6 +468,11 @@ typedef enum X86Seg {
>>   #define MSR_CORE_PERF_GLOBAL_CTRL       0x38f
>>   #define MSR_CORE_PERF_GLOBAL_OVF_CTRL   0x390
>>   +#define MSR_K7_EVNTSEL0                 0xc0010000
>> +#define MSR_K7_PERFCTR0                 0xc0010004
>> +#define MSR_F15H_PERF_CTL0              0xc0010200
>> +#define MSR_F15H_PERF_CTR0              0xc0010201
>> +
>>   #define MSR_MC0_CTL                     0x400
>>   #define MSR_MC0_STATUS                  0x401
>>   #define MSR_MC0_ADDR                    0x402
>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>> index 0b1226ff7f..023fcbce48 100644
>> --- a/target/i386/kvm/kvm.c
>> +++ b/target/i386/kvm/kvm.c
>> @@ -2005,6 +2005,32 @@ int kvm_arch_init_vcpu(CPUState *cs)
>>           }
>>       }
>>   +    if (IS_AMD_CPU(env)) {
>> +        int64_t family;
>> +
>> +        family = (env->cpuid_version >> 8) & 0xf;
>> +        if (family == 0xf) {
>> +            family += (env->cpuid_version >> 20) & 0xff;
>> +        }
>> +
>> +        /*
>> +         * If KVM_CAP_PMU_CAPABILITY is not supported, there is no way to
>> +         * disable the AMD pmu virtualization.
>> +         *
>> +         * If KVM_CAP_PMU_CAPABILITY is supported, "!has_pmu_cap" indicates
>> +         * the KVM side has already disabled the pmu virtualization.
>> +         */
>> +        if (family >= 6 && (!has_pmu_cap || cpu->enable_pmu)) {
>> +            has_architectural_pmu_version = 1;
>> +
>> +            if (env->features[FEAT_8000_0001_ECX] & CPUID_EXT3_PERFCORE) {
>> +                num_architectural_pmu_gp_counters = 6;
>> +            } else {
>> +                num_architectural_pmu_gp_counters = 4;
>> +            }
>> +        }
>> +    }
>> +
>>       cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused);
>>         for (i = 0x80000000; i <= limit; i++) {
>> @@ -3326,7 +3352,7 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>>               kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL,
>> env->poll_control_msr);
>>           }
>>   -        if (has_architectural_pmu_version > 0) {
>> +        if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) {
>>               if (has_architectural_pmu_version > 1) {
>>                   /* Stop the counter.  */
>>                   kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>> @@ -3357,6 +3383,26 @@ static int kvm_put_msrs(X86CPU *cpu, int level)
>>                                     env->msr_global_ctrl);
>>               }
>>           }
>> +
>> +        if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) {
>> +            uint32_t sel_base = MSR_K7_EVNTSEL0;
>> +            uint32_t ctr_base = MSR_K7_PERFCTR0;
>> +            uint32_t step = 1;
>> +
>> +            if (num_architectural_pmu_gp_counters == 6) {
>> +                sel_base = MSR_F15H_PERF_CTL0;
>> +                ctr_base = MSR_F15H_PERF_CTR0;
>> +                step = 2;
>> +            }
>> +
>> +            for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
>> +                kvm_msr_entry_add(cpu, ctr_base + i * step,
>> +                                  env->msr_gp_counters[i]);
>> +                kvm_msr_entry_add(cpu, sel_base + i * step,
>> +                                  env->msr_gp_evtsel[i]);
>> +            }
>> +        }
>> +
>>           /*
>>            * Hyper-V partition-wide MSRs: to avoid clearing them on cpu hot-add,
>>            * only sync them to KVM on the first cpu
>> @@ -3817,7 +3863,7 @@ static int kvm_get_msrs(X86CPU *cpu)
>>       if (env->features[FEAT_KVM] & (1 << KVM_FEATURE_POLL_CONTROL)) {
>>           kvm_msr_entry_add(cpu, MSR_KVM_POLL_CONTROL, 1);
>>       }
>> -    if (has_architectural_pmu_version > 0) {
>> +    if (has_architectural_pmu_version > 0 && IS_INTEL_CPU(env)) {
>>           if (has_architectural_pmu_version > 1) {
>>               kvm_msr_entry_add(cpu, MSR_CORE_PERF_FIXED_CTR_CTRL, 0);
>>               kvm_msr_entry_add(cpu, MSR_CORE_PERF_GLOBAL_CTRL, 0);
>> @@ -3833,6 +3879,25 @@ static int kvm_get_msrs(X86CPU *cpu)
>>           }
>>       }
>>   +    if (has_architectural_pmu_version > 0 && IS_AMD_CPU(env)) {
>> +        uint32_t sel_base = MSR_K7_EVNTSEL0;
>> +        uint32_t ctr_base = MSR_K7_PERFCTR0;
>> +        uint32_t step = 1;
>> +
>> +        if (num_architectural_pmu_gp_counters == 6) {
>> +            sel_base = MSR_F15H_PERF_CTL0;
>> +            ctr_base = MSR_F15H_PERF_CTR0;
>> +            step = 2;
>> +        }
>> +
>> +        for (i = 0; i < num_architectural_pmu_gp_counters; i++) {
>> +            kvm_msr_entry_add(cpu, ctr_base + i * step,
>> +                              env->msr_gp_counters[i]);
>> +            kvm_msr_entry_add(cpu, sel_base + i * step,
>> +                              env->msr_gp_evtsel[i]);
>> +        }
>> +    }
>> +
>>       if (env->mcg_cap) {
>>           kvm_msr_entry_add(cpu, MSR_MCG_STATUS, 0);
>>           kvm_msr_entry_add(cpu, MSR_MCG_CTL, 0);
>> @@ -4118,6 +4183,20 @@ static int kvm_get_msrs(X86CPU *cpu)
>>           case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL0 + MAX_GP_COUNTERS - 1:
>>               env->msr_gp_evtsel[index - MSR_P6_EVNTSEL0] = msrs[i].data;
>>               break;
>> +        case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL0 + 3:
>> +            env->msr_gp_evtsel[index - MSR_K7_EVNTSEL0] = msrs[i].data;
>> +            break;
>> +        case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR0 + 3:
>> +            env->msr_gp_counters[index - MSR_K7_PERFCTR0] = msrs[i].data;
>> +            break;
>> +        case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTL0 + 0xb:
>> +            index = index - MSR_F15H_PERF_CTL0;
>> +            if (index & 0x1) {
>> +                env->msr_gp_counters[index] = msrs[i].data;
>> +            } else {
>> +                env->msr_gp_evtsel[index] = msrs[i].data;
>> +            }
>> +            break;
>>           case HV_X64_MSR_HYPERCALL:
>>               env->msr_hv_hypercall = msrs[i].data;
>>               break;