Re: [PATCH] Documentation: KVM: Add vPMU implementaion and gap document

"Yang, Weijiang" <weijiang.yang@xxxxxxxxx> · Wed, 26 Jul 2023 16:01:40 +0800

On 7/24/2023 6:41 PM, Xiong Zhang wrote:
Add a vPMU implementation and gap document to explain vArch PMU and vLBR
implementation in kvm, especially the current gap to support host and
guest perf event coexist.

Signed-off-by: Xiong Zhang <xiong.y.zhang@xxxxxxxxx>
---
  Documentation/virt/kvm/x86/index.rst |   1 +
  Documentation/virt/kvm/x86/pmu.rst   | 249 +++++++++++++++++++++++++++
  2 files changed, 250 insertions(+)
  create mode 100644 Documentation/virt/kvm/x86/pmu.rst

diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
index 9ece6b8dc817..02c1c7b01bf3 100644
--- a/Documentation/virt/kvm/x86/index.rst
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -14,5 +14,6 @@ KVM for x86 systems
     mmu
     msr
     nested-vmx
+   pmu
     running-nested-guests
     timekeeping
diff --git a/Documentation/virt/kvm/x86/pmu.rst b/Documentation/virt/kvm/x86/pmu.rst
new file mode 100644
index 000000000000..e95e8c88e0e0
--- /dev/null
+++ b/Documentation/virt/kvm/x86/pmu.rst
@@ -0,0 +1,249 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+PMU virtualization for X86
+==========================
+
+:Author: Xiong Zhang <xiong.y.zhang@xxxxxxxxx>
+:Copyright: (c) 2023, Intel.  All rights reserved.
+
+.. Contents
+
+1. Overview
+2. Perf Scheduler
+3. Arch PMU virtualization
+4. LBR virtualization
+
+1. Overview
+===========
+
+KVM has supported PMU virtualization on x86 for many years and provides
+MSR based Arch PMU interface to the guest. The major features include
+Arch PMU v2, LBR and PEBS. Users have the same operation to profile
+performance in guest and host.
+KVM is a normal perf subsystem user as other perf subsystem users. When
+the guest access vPMU MSRs, KVM traps it and creates a perf event for it.
+This perf event takes part in perf scheduler to request PMU resources
+and let the guest use these resources.
+
+This document describes the X86 PMU virtualization architecture design
+and opens. It is organized as follows: Next section describes more
+details of Linux perf scheduler as it takes a key role in vPMU
+implementation and allocates PMU resources for guest usage. Then Arch
+PMU virtualization and LBR virtualization are introduced, each feature
+has sections to introduce implementation overview,  the expectation and
+gaps when host and guest perf events coexist.
+
+2. Perf Scheduler
+=================
+
+Perf scheduler's responsibility is choosing which events are active at
+one moment and binding counter with perf event. As processor has limited
+PMU counters and other resource, only limited perf events can be active
+at one moment, the inactive perf event may be active in the next moment,
+perf scheduler has defined rules to control these things.
+
+Usually the following cases cause perf event reschedule:
+1) On a context switch from one task to a different task.
+2) When an event is manually enabled.
+3) A call to perf_event_open() with disabled field of the
+perf_event_attr argument set to 0.

And when perf scheduler timer expires.

+
+When perf event reschedule is needed on a specific cpu, perf will send
+an IPI to the target cpu, and the IPI handler will activate events
+ordered by event type, and will iterate all the eligible events.

IIUC, this is only true for the event create case, not for all above 
reschedule cases.

+
+When a perf event is sched out, this event mapped counter is disabled,
+and the counter's setting and count value are saved. When a perf event
+is sched in, perf driver assigns a counter to this event, the counter's
+setting and count values are restored from last saved.
+
+Perf defines four types event, their priority are from high to low:
+a. Per-cpu pinned: the event should be measured on the specified logical
+core whenever it is enabled.
+b. Per-process pinned: the event should be measured whenever it is
+enabled and the process is running on any logical cores.
+c. Per-cpu flexible: the event should measured on the specified logical
+core whenever it is enabled.
+d. Per-process flexible: the event should be measured whenever it is
+enabled and the process is running on any logical cores.
+
+If the event could not be scheduled because no resource is available for
+it, pinned event goes into error state and is excluded from perf
+scheduler, the only way to recover it is re-enable it, flexible event
+goes into inactive state and can be multiplexed with other events if
+needed.

Maybe you can add some diagrams or list some key definitions/data 
structures/prototypes

to facilitate readers to understand more about perf schedule since it's 
the key of perf subsystem.

+
+3. Arch PMU virtualization
+==========================
+
+3.1. Overview
+-------------
+
+Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest
+PMU driver would access the Arch PMU MSRs (including Fixed and GP
+counter) as the host does. All the guest Arch PMU MSRs accessing are
+interceptable.
+
+When a guest virtual counter is enabled through guest MSR writing, the
+KVM trap will create a kvm perf event through the perf subsystem. The
+kvm perf event's attribute is gotten from the guest virtual counter's
+MSR setting.
+
+When a guest changes the virtual counter's setting later, the KVM trap
+will release the old kvm perf event then create a new kvm perf event
+with the new setting.
+
+When guest read the virtual counter's count number, the kvm trap will
+read kvm perf event's counter value and accumulate it to the previous
+counter value.
+
+When guest no longer access the virtual counter's MSR within a
+scheduling time slice and the virtual counter is disabled, KVM will
+release the kvm perf event.
+  ----------------------------
+  |  Guest                   |
+  |  perf subsystem          |
+  ----------------------------
+       |            ^
+  vMSR |            | vPMI
+       v            |
+  ----------------------------
+  |  vPMU        KVM vCPU    |
+  ----------------------------
+        |          ^
+  Call  |          | Callbacks
+        v          |
+  ---------------------------
+  | Host Linux Kernel       |
+  | perf subsystem          |
+  ---------------------------
+               |       ^
+           MSR |       | PMI
+               v       |
+         --------------------
+	 | PMU        CPU   |
+         --------------------
+
+Each guest virtual counter has a corresponding kvm perf event, and the
+kvm perf event joins host perf scheduler and complies with host perf
+scheduler rule. When kvm perf event is scheduled by host perf scheduler
+and is active, the guest virtual counter could supply the correct value.
+However, if another host perf event comes in and takes over the kvm perf
+event resource, the kvm perf event will be inactive, then the virtual
+counter supplies wrong and meaningless value.

IMHO, the data is still valid for preempted event as it's saved when the 
event is sched_out.

But it doesn't match the running task under profiling, and this is 
normal when perf

preemption exits.

+
+3.2. Host and Guest perf event contention
+-----------------------------------------
+
+Kvm perf event is a per-process pinned event, its priority is second.
+When kvm perf event is active, it can be preempted by host per-cpu
+pinned perf event, or it can preempt host flexible perf events. Such
+preemption can be temporarily prohibited through disabling host IRQ.
+
+The following results are expected when host and guest perf event
+coexist according to perf scheduler rule:
+1). if host per cpu pinned events occupy all the HW resource, kvm perf
+event can not be active as no available resource, the virtual counter
+value is  zero always when the guest read it.
+2). if host per cpu pinned event release HW resource, and kvm perf event
+is inactive, kvm perf event can claim the HW resource and switch into
+active, then the guest can get the correct value from the guest virtual
+counter during kvm perf event is active, but the guest total counter
+value is not correct since counter value is lost during kvm perf event
+is inactive.
+3). if kvm perf event is active, then host per cpu pinned perf event
+becomes active and reclaims kvm perf event resource, kvm perf event will
+be inactive. Finally the virtual counter value is kept unchanged and
+stores previous saved value when the guest reads it. So the guest toatal
+counter isn't correct.
+4). If host flexible perf events occupy all the HW resource, kvm perf
+event can be active and preempts host flexible perf event resource,
+guest can get the correct value from the guest virtual counter.
+5). if kvm perf event is active, then other host flexible perf events
+request to active, kvm perf event still own the resource and active, so
+guest can get the correct value from the guest virtual counter.
+
+3.3. vPMU Arch Gaps
+-------------------
+
+The coexist of host and guest perf events has gap:
+1). when guest accesses PMU MSRs at the first time, KVM will trap it and
+create kvm perf event, but this event may be inactive because the
+contention with host perf event. But guest doesn't notice this and when
+guest read virtual counter, the return value is zero.
+2). when kvm perf event is active, host per-cpu pinned perf event can
+reclaim kvm perf event resource at any time once resource contention
+happens. But guest doesn't notice this neither and guest following
+counter accesses get wrong data.
+So maillist had some discussion titled "Reconsider the current approach
+of vPMU".
+
+https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@xxxxxxxxx/
+
+The major suggestion in this discussion is host pass-through some
+counters into guest, but this suggestion is not feasible, the reasons
+are:
+a. processor has several counters, but counters are not equal, some
+event must bind with a specific counter.
+b. if a special counter is passthrough into guest, host can not support
+such event and lose some capability.
+c. if a normal counter is passthrough into guest, guest can support
+general event only, and the guest has limited capability.
+So both host and guest lose capability in pass-through mode.
+
+4. LBR Virtualization
+=====================
+
+4.1. Overview
+-------------
+
+The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR
+and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability
+into guest,  The first guest access on LBR related MSRs is always
+interceptable. The KVM trap would create a vLBR perf event which enables
+the callstack mode and none of the hardware counters are assigned. The
+host perf would enable and schedule this event as usual.
+
+When vLBR event is scheduled by host perf scheduler and is active, host
+LBR MSRs are owned by guest and are pass-through into guest, guest will
+access them without VM Exit. However, if another host LBR event comes in
+and takes over the LBR facility, the vLBR event will be inactive, and
+guest following accesses to the LBR MSRs will be trapped and meaningless.

Is this true only when host created a pinned LBR event? Otherwise, it 
won't preempt

the guest vLBR.


+
+As kvm perf event, vLBR event will be released when guest doesn't access
+LBR-related MSRs within a scheduling time slice and guest unset LBR
+enable bit, then the pass-through state of the LBR MSRs will be canceled.
+
+4.2. Host and Guest LBR contention
+----------------------------------
+
+vLBR event is a per-process pinned event, its priority is second. vLBR
+event together with host other LBR event to contend LBR resource,
+according to perf scheduler rule, when vLBR event is active, it can be
+preempted by host per-cpu pinned LBR event, or it can preempt host
+flexible LBR event. Such preemption can be temporarily prohibited
+through disabling host IRQ as perf scheduler uses IPI to change LBR owner.
+
+The following results are expected when host and guest LBR event coexist:
+1) If host per cpu pinned LBR event is active when vm starts, the guest
+vLBR event can not preempt the LBR resource, so the guest can not use
+LBR.
+2). If host flexible LBR events are active when vm starts, guest vLBR
+event can preempt LBR, so the guest can use LBR.
+3). If host per cpu pinned LBR event becomes enabled when guest vLBR
+event is active, the guest vLBR event will lose LBR and the guest can
+not use LBR anymore.
+4). If host flexible LBR event becomes enabled when guest vLBR event is
+active, the guest vLBR event keeps LBR, the guest can still use LBR.
+5). If host per cpu pinned LBR event becomes inactive when guest vLBR
+event is inactive, guest vLBR event can be active and own LBR, the guest
+can use LBR.

Anyway, vLBR problems is still induced by perf scheduling priorities, if 
you can

clearly state current gaps of vPMU, it's also clear for vLBR issue, then 
this section

could be omitted.

+
+4.3. vLBR Arch Gaps
+-------------------
+
+Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned
+event at any time, or vLBR event is inactive at creation, but guest
+can not notice this, so the guest will get meaningless value when the
+vLBR event is inactive.