On Fri, Apr 12, 2024 at 9:25 PM Mi, Dapeng <dapeng1.mi@xxxxxxxxxxxxxxx> wrote: > > > On 4/13/2024 11:34 AM, Mingwei Zhang wrote: > > On Sat, Apr 13, 2024, Mi, Dapeng wrote: > >> On 4/12/2024 5:44 AM, Sean Christopherson wrote: > >>> On Fri, Jan 26, 2024, Xiong Zhang wrote: > >>>> From: Dapeng Mi <dapeng1.mi@xxxxxxxxxxxxxxx> > >>>> > >>>> Implement the save/restore of PMU state for pasthrough PMU in Intel. In > >>>> passthrough mode, KVM owns exclusively the PMU HW when control flow goes to > >>>> the scope of passthrough PMU. Thus, KVM needs to save the host PMU state > >>>> and gains the full HW PMU ownership. On the contrary, host regains the > >>>> ownership of PMU HW from KVM when control flow leaves the scope of > >>>> passthrough PMU. > >>>> > >>>> Implement PMU context switches for Intel CPUs and opptunistically use > >>>> rdpmcl() instead of rdmsrl() when reading counters since the former has > >>>> lower latency in Intel CPUs. > >>>> > >>>> Co-developed-by: Mingwei Zhang <mizhang@xxxxxxxxxx> > >>>> Signed-off-by: Mingwei Zhang <mizhang@xxxxxxxxxx> > >>>> Signed-off-by: Dapeng Mi <dapeng1.mi@xxxxxxxxxxxxxxx> > >>>> --- > >>>> arch/x86/kvm/vmx/pmu_intel.c | 73 ++++++++++++++++++++++++++++++++++++ > >>>> 1 file changed, 73 insertions(+) > >>>> > >>>> diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c > >>>> index 0d58fe7d243e..f79bebe7093d 100644 > >>>> --- a/arch/x86/kvm/vmx/pmu_intel.c > >>>> +++ b/arch/x86/kvm/vmx/pmu_intel.c > >>>> @@ -823,10 +823,83 @@ void intel_passthrough_pmu_msrs(struct kvm_vcpu *vcpu) > >>>> static void intel_save_pmu_context(struct kvm_vcpu *vcpu) > >>> I would prefer there be a "guest" in there somewhere, e.g. intel_save_guest_pmu_context(). > >> Yeah. It looks clearer. > >>>> { > >>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); > >>>> + struct kvm_pmc *pmc; > >>>> + u32 i; > >>>> + > >>>> + if (pmu->version != 2) { > >>>> + pr_warn("only PerfMon v2 is supported for passthrough PMU"); > >>>> + return; > >>>> + } > >>>> + > >>>> + /* Global ctrl register is already saved at VM-exit. */ > >>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, pmu->global_status); > >>>> + /* Clear hardware MSR_CORE_PERF_GLOBAL_STATUS MSR, if non-zero. */ > >>>> + if (pmu->global_status) > >>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, pmu->global_status); > >>>> + > >>>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) { > >>>> + pmc = &pmu->gp_counters[i]; > >>>> + rdpmcl(i, pmc->counter); > >>>> + rdmsrl(i + MSR_ARCH_PERFMON_EVENTSEL0, pmc->eventsel); > >>>> + /* > >>>> + * Clear hardware PERFMON_EVENTSELx and its counter to avoid > >>>> + * leakage and also avoid this guest GP counter get accidentally > >>>> + * enabled during host running when host enable global ctrl. > >>>> + */ > >>>> + if (pmc->eventsel) > >>>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, 0); > >>>> + if (pmc->counter) > >>>> + wrmsrl(MSR_IA32_PMC0 + i, 0); > >>> This doesn't make much sense. The kernel already has full access to the guest, > >>> I don't see what is gained by zeroing out the MSRs just to hide them from perf. > >> It's necessary to clear the EVENTSELx MSRs for both GP and fixed counters. > >> Considering this case, Guest uses GP counter 2, but Host doesn't use it. So > >> if the EVENTSEL2 MSR is not cleared here, the GP counter 2 would be enabled > >> unexpectedly on host later since Host perf always enable all validate bits > >> in PERF_GLOBAL_CTRL MSR. That would cause issues. > >> > >> Yeah, the clearing for PMCx MSR should be unnecessary . > >> > > Why is clearing for PMCx MSR unnecessary? Do we want to leaking counter > > values to the host? NO. Not in cloud usage. > > No, this place is clearing the guest counter value instead of host > counter value. Host always has method to see guest value in a normal VM > if he want. I don't see its necessity, it's just a overkill and > introduce extra overhead to write MSRs. > I am curious how the perf subsystem solves the problem? Does perf subsystem in the host only scrubbing the selector but not the counter value when doing the context switch? > > > > > Please make changes to this patch with **extreme** caution. > > > > According to our past experience, if there is a bug somewhere, > > there is a catch here (normally). > > > > Thanks. > > -Mingwei > >>> Similarly, if perf enables a counter if PERF_GLOBAL_CTRL without first restoring > >>> the event selector, we gots problems. > >>> > >>> Same thing for the fixed counters below. Can't this just be? > >>> > >>> for (i = 0; i < pmu->nr_arch_gp_counters; i++) > >>> rdpmcl(i, pmu->gp_counters[i].counter); > >>> > >>> for (i = 0; i < pmu->nr_arch_fixed_counters; i++) > >>> rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, > >>> pmu->fixed_counters[i].counter); > >>> > >>>> + } > >>>> + > >>>> + rdmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl); > >>>> + /* > >>>> + * Clear hardware FIXED_CTR_CTRL MSR to avoid information leakage and > >>>> + * also avoid these guest fixed counters get accidentially enabled > >>>> + * during host running when host enable global ctrl. > >>>> + */ > >>>> + if (pmu->fixed_ctr_ctrl) > >>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, 0); > >>>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) { > >>>> + pmc = &pmu->fixed_counters[i]; > >>>> + rdpmcl(INTEL_PMC_FIXED_RDPMC_BASE | i, pmc->counter); > >>>> + if (pmc->counter) > >>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, 0); > >>>> + } > >>>> } > >>>> static void intel_restore_pmu_context(struct kvm_vcpu *vcpu) > >>>> { > >>>> + struct kvm_pmu *pmu = vcpu_to_pmu(vcpu); > >>>> + struct kvm_pmc *pmc; > >>>> + u64 global_status; > >>>> + int i; > >>>> + > >>>> + if (pmu->version != 2) { > >>>> + pr_warn("only PerfMon v2 is supported for passthrough PMU"); > >>>> + return; > >>>> + } > >>>> + > >>>> + /* Clear host global_ctrl and global_status MSR if non-zero. */ > >>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0); > >>> Why? PERF_GLOBAL_CTRL will be auto-loaded at VM-Enter, why do it now? > >> As previous comments say, host perf always enable all counters in > >> PERF_GLOBAL_CTRL by default. The reason to clear PERF_GLOBAL_CTRL here is to > >> ensure all counters in disabled state and the later counter manipulation > >> (writing MSR) won't cause any race condition or unexpected behavior on HW. > >> > >> > >>>> + rdmsrl(MSR_CORE_PERF_GLOBAL_STATUS, global_status); > >>>> + if (global_status) > >>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_OVF_CTRL, global_status); > >>> This seems especially silly, isn't the full MSR being written below? Or am I > >>> misunderstanding how these things work? > >> I think Jim's comment has already explain why we need to do this. > >> > >>>> + wrmsrl(MSR_CORE_PERF_GLOBAL_STATUS_SET, pmu->global_status); > >>>> + > >>>> + for (i = 0; i < pmu->nr_arch_gp_counters; i++) { > >>>> + pmc = &pmu->gp_counters[i]; > >>>> + wrmsrl(MSR_IA32_PMC0 + i, pmc->counter); > >>>> + wrmsrl(MSR_ARCH_PERFMON_EVENTSEL0 + i, pmc->eventsel); > >>>> + } > >>>> + > >>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR_CTRL, pmu->fixed_ctr_ctrl); > >>>> + for (i = 0; i < pmu->nr_arch_fixed_counters; i++) { > >>>> + pmc = &pmu->fixed_counters[i]; > >>>> + wrmsrl(MSR_CORE_PERF_FIXED_CTR0 + i, pmc->counter); > >>>> + } > >>>> } > >>>> struct kvm_pmu_ops intel_pmu_ops __initdata = { > >>>> -- > >>>> 2.34.1 > >>>>