Re: [RFC PATCH v3 37/58] KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary

Sean Christopherson <seanjc@xxxxxxxxxx> · Tue, 19 Nov 2024 05:44:09 -0800

On Tue, Nov 19, 2024, Dapeng Mi wrote:
> 
> On 11/19/2024 9:46 AM, Sean Christopherson wrote:
> > On Fri, Oct 25, 2024, Dapeng Mi wrote:
> >> On 10/25/2024 4:26 AM, Chen, Zide wrote:
> >>> On 7/31/2024 9:58 PM, Mingwei Zhang wrote:
> >>>
> >>>> @@ -7295,6 +7299,46 @@ static void atomic_switch_perf_msrs(struct vcpu_vmx *vmx)
> >>>>  					msrs[i].host, false);
> >>>>  }
> >>>>  
> >>>> +static void save_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> >>>> +{
> >>>> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> >>>> +	int i;
> >>>> +
> >>>> +	if (vm_exit_controls_get(vmx) & VM_EXIT_SAVE_IA32_PERF_GLOBAL_CTRL) {
> >>>> +		pmu->global_ctrl = vmcs_read64(GUEST_IA32_PERF_GLOBAL_CTRL);
> >>> As commented in patch 26, compared with MSR auto save/store area
> >>> approach, the exec control way needs one relatively expensive VMCS read
> >>> on every VM exit.
> >> Anyway, let us have a evaluation and data speaks.
> > No, drop the unconditional VMREAD and VMWRITE, one way or another.  No benchmark
> > will notice ~50 extra cycles, but if we write poor code for every feature, those
> > 50 cycles per feature add up.
> >
> > Furthermore, checking to see if the CPU supports the load/save VMCS controls at
> > runtime beyond ridiculous.  The mediated PMU requires ***VERSION 4***; if a CPU
> > supports PMU version 4 and doesn't support the VMCS controls, KVM should yell and
> > disable the passthrough PMU.  The amount of complexity added here to support a
> > CPU that should never exist is silly.
> >
> >>>> +static void load_perf_global_ctrl_in_passthrough_pmu(struct vcpu_vmx *vmx)
> >>>> +{
> >>>> +	struct kvm_pmu *pmu = vcpu_to_pmu(&vmx->vcpu);
> >>>> +	u64 global_ctrl = pmu->global_ctrl;
> >>>> +	int i;
> >>>> +
> >>>> +	if (vm_entry_controls_get(vmx) & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL) {
> >>>> +		vmcs_write64(GUEST_IA32_PERF_GLOBAL_CTRL, global_ctrl);
> >>> ditto.
> >>>
> >>> We may optimize it by introducing a new flag pmu->global_ctrl_dirty and
> >>> update GUEST_IA32_PERF_GLOBAL_CTRL only when it's needed.  But this
> >>> makes the code even more complicated.
> > I haven't looked at surrounding code too much, but I guarantee there's _zero_
> > reason to eat a VMWRITE+VMREAD on every transition.  If, emphasis on *if*, KVM
> > accesses PERF_GLOBAL_CTRL frequently, e.g. on most exits, then add a VCPU_EXREG_XXX
> > and let KVM's caching infrastructure do the heavy lifting.  Don't reinvent the
> > wheel.  But first, convince the world that KVM actually accesses the MSR somewhat
> > frequently.
> 
> Sean, let me give more background here.
> 
> VMX supports two ways to save/restore PERF_GLOBAL_CTRL MSR, one is to
> leverage VMCS_EXIT_CTRL/VMCS_ENTRY_CTRL to save/restore guest
> PERF_GLOBAL_CTRL value to/from VMCS guest state. The other is to use the
> VMCS MSR auto-load/restore bitmap to save/restore guest PERF_GLOBAL_CTRL. 

I know.

> Currently we prefer to use the former way to save/restore guest
> PERF_GLOBAL_CTRL as long as HW supports it. There is a limitation on the
> MSR auto-load/restore feature. When there are multiple MSRs, the MSRs are
> saved/restored in the order of MSR index. As the suggestion of SDM,
> PERF_GLOBAL_CTRL should always be written at last after all other PMU MSRs
> are manipulated. So if there are some PMU MSRs whose index is larger than
> PERF_GLOBAL_CTRL (It would be true in archPerfmon v6+, all PMU MSRs in the
> new MSR range have larger index than PERF_GLOBAL_CTRL),

No, the entries in the load/store lists are processed in sequential order as they
appear in the lists.  Ordering them based on their MSR index would be insane and
would make the lists useless.

  VM entries may load MSRs from the VM-entry MSR-load area (see Section 25.8.2).
  Specifically each entry in that area (up to the number specified in the VM-entry
  MSR-load count) is processed in order by loading the MSR indexed by bits 31:0
  with the contents of bits 127:64 as they would be written by WRMSR.1

> these PMU MSRs would be restored after PERF_GLOBAL_CTRL. That would break the
> rule. Of course, it's good to save/restore PERF_GLOBAL_CTRL right now with
> the VMCS VMCS MSR auto-load/restore bitmap feature since only one PMU MSR
> PERF_GLOBAL_CTRL is saved/restored in current implementation.

No, it's never good to use the load/store lists.  They're slow as mud, because
they're essentially just wrappers to the standard WRMSR/RDMSR ucode.  Whereas
dedicated VMCS fields have dedicated, streamlined ucode to make loads and stores
as fast as possible.

I haven't measured PERF_GLOBAL_CTRL specifically, at least not in recent memory,
but generally speaking using a load/store entry is 100+ cycles, whereas using a
dedicated VMCS field is <20 cyles (often far less).

So what I am saying is that the mediated PMU should _require_ support for loading
and saving PERF_GLOBAL_CTRL via dedicated fields, and WARN if a CPU with a v4+
PMU doesn't support said fields.  E.g.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a4b2b0b69a68..cab8305e7bf0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8620,6 +8620,15 @@ __init int vmx_hardware_setup(void)
                enable_sgx = false;
 #endif
 
+       /*
+        * All CPUs that support a mediated PMU are expected to support loading
+        * and saving PERF_GLOBAL_CTRL via dedicated VMCS fields.
+        */
+       if (enable_passthrough_pmu &&
+           (WARN_ON_ONCE(!cpu_has_load_perf_global_ctrl() ||
+                         !cpu_has_save_perf_global_ctrl())))
+               enable_passthrough_pmu = false;
+
        /*
         * set_apic_access_page_addr() is used to reload apic access
         * page upon invalidation.  No need to do anything if not

That will provide better, more consistent performance, and will eliminate a big
pile of non-trivial code.

> PERF_GLOBAL_CTRL MSR could be frequently accessed by perf/pmu driver, e.g.
> on each task switch, so PERF_GLOBAL_CTRL MSR is configured to passthrough
> to reduce the performance impact in mediated vPMU proposal if guest own all
> PMU HW resource. But if guest only owns part of PMU HW resource,
> PERF_GLOBAL_CTRL would be set to interception mode.

Again, I know.  What I am saying is that propagating PERF_GLOBAL_CTRL to/from the
VMCS on every entry and exit is extremely wasteful and completely unnecessary.

> I suppose KVM doesn't need access PERF_GLOBAL_CTRL in passthrough mode.
> This piece of code is intently just for PERF_GLOBAL_CTRL interception mode,

No, it's even more useless if PERF_GLOBAL_CTRL is intercepted, because in that
case the _only_ time KVM needs move the guest's value to/from the VMCS is when
the guest (or host userspace) is explicitly accessing the field.

> but think twice it looks unnecessary to save/restore PERF_GLOBAL_CTRL via
> VMCS as KVM would always maintain the guest PERF_GLOBAL_CTRL value? Anyway,
> this part of code can be optimized.