On 10/8/2023 1:45 pm, Xiong Zhang wrote:
Add a vPMU implementation and gap document to explain vArch PMU and vLBR
implementation in kvm, especially the current gap to support host and
guest perf event coexist.
Signed-off-by: Xiong Zhang <xiong.y.zhang@xxxxxxxxx>
---
Changelog:
v2 -> v3:
* When kvm perf event is inactive, it is in error state actually,
so inactive is changed into error.
* Fix make htmldoc warning
v1 -> v2:
* Refactor perf scheduler section
* Correct one sentence in vArch PMU section
---
Documentation/virt/kvm/x86/index.rst | 1 +
Documentation/virt/kvm/x86/pmu.rst | 332 +++++++++++++++++++++++++++
2 files changed, 333 insertions(+)
create mode 100644 Documentation/virt/kvm/x86/pmu.rst
diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst
index 9ece6b8dc817..02c1c7b01bf3 100644
--- a/Documentation/virt/kvm/x86/index.rst
+++ b/Documentation/virt/kvm/x86/index.rst
@@ -14,5 +14,6 @@ KVM for x86 systems
mmu
msr
nested-vmx
+ pmu
running-nested-guests
timekeeping
diff --git a/Documentation/virt/kvm/x86/pmu.rst b/Documentation/virt/kvm/x86/pmu.rst
new file mode 100644
index 000000000000..503516c41119
--- /dev/null
+++ b/Documentation/virt/kvm/x86/pmu.rst
@@ -0,0 +1,332 @@
+.. SPDX-License-Identifier: GPL-2.0
WARNING: Missing or malformed SPDX-License-Identifier tag in line 1
#34: FILE: Documentation/virt/kvm/x86/pmu.rst:1:
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+PMU virtualization for X86
For Intel ?
Many of the descriptions below are Intel features, so you either have to
add AMD part or limit the scope, otherwise the reader will be confused.
+==========================
+
+:Author: Xiong Zhang <xiong.y.zhang@xxxxxxxxx>
+:Copyright: (c) 2023, Intel. All rights reserved.
+
+.. Contents
+
+1. Overview
+2. Perf Scheduler Basic
I would strongly suggest moving this section out of KVM subsystem,
Two possible locations:
- Documentation/admin-guide/perf
- tools/perf/Documentation/
Or the better one, the man2/perf_event_open.2 page in the man-pages project:
-
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/perf_event_open.2
It's more generic and someone more appropriate could update and maintain it for
more readers.
Here we only need to refer to it so as not to bother users with
outdated/mismatched content.
+3. Arch PMU virtualization
+4. LBR virtualization
How about organizing it this way:
1. PMU Overview
2. Capability Enumeration
3. Basic Counter Virtualization
4. Intel PEBS Virtualization
5. Intel LBR Virtualization
6. KVM PMU APIs Guidance
7. Limitations and TODOs
8. Reference
+
+1. Overview
+===========
+
+KVM has supported PMU virtualization on x86 for many years and provides
+MSR based Arch PMU interface to the guest. The major features include
+Arch PMU v2, LBR and PEBS. Users have the same operation to profile
+performance in guest and host.
+KVM is a normal perf subsystem user as other perf subsystem users. When
+the guest access vPMU MSRs, KVM traps it and creates a perf event for it.
+This perf event takes part in perf scheduler to request PMU resources
+and let the guest use these resources.
+
+This document describes the X86 PMU virtualization architecture design
+and opens. It is organized as follows: Next section describes more
+details of Linux perf scheduler as it takes a key role in vPMU
+implementation and allocates PMU resources for guest usage. Then Arch
+PMU virtualization and LBR virtualization are introduced, each feature
+has sections to introduce implementation overview, the expectation and
+gaps when host and guest perf events coexist.
+
+2. Perf Scheduler Basic
+=======================
+
+Perf subsystem users can not get PMU counter or resource directly, user
+should create a perf event first and specify event’s attribute which is
+used to choose PMU counters, then perf event joins in perf scheduler,
+perf scheduler assigns the corresponding PMU counter to a perf event.
+
+Perf event is created by perf_event_open() system call::
+
+ int syscall(SYS_perf_event_open, struct perf_event_attr *,
+ pid, cpu, group_fd, flags)
+ struct perf_event_attr {
+ ......
+ /* Major type: hardware/software/tracepoint/etc. */
+ __u32 type;
+ /* Type specific configuration information. */
+ __u64 config;
+ union {
+ __u64 sample_period;
+ __u64 sample_freq;
+ }
+ __u64 disabled :1;
+ pinned :1;
+ exclude_user :1;
+ exclude_kernel :1;
+ exclude_host :1;
+ exclude_guest :1;
+ ......
+ }
+
+The pid and cpu arguments allow specifying which process and CPU
+to monitor::
+
+ pid == 0 and cpu == -1
+ This measures the calling process/thread on any CPU.
+ pid == 0 and cpu >= 0
+ This measures the calling process/thread only when running on
+ the specified cpu.
+ pid > 0 and cpu == -1
+ This measures the specified process/thread on any cpu.
+ pid > 0 and cpu >= 0
+ This measures the specified process/thread only when running
+ on the specified CPU.
+ pid == -1 and cpu >= 0
+ This measures all processes/threads on the specified CPU.
+ pid == -1 and cpu == -1
+ This setting is invalid and will return an error.
+
+Perf scheduler's responsibility is choosing which events are active at
+one moment and binding counter with perf event. As processor has limited
+PMU counters and other resource, only limited perf events can be active
+at one moment, the inactive perf event may be active in the next moment,
+perf scheduler has defined rules to control these things.
+
+Perf scheduler defines four types of perf event, defined by the pid and
+cpu arguments in perf_event_open(), plus perf_event_attr.pinned, their
+schedule priority are: per_cpu pinned > per_process pinned
+> per_cpu flexible > per_process flexible. High priority events can
+preempt low priority events when resources contend.
+
+perf event type::
+
+ --------------------------------------------------------
+ | | pid | cpu | pinned |
+ --------------------------------------------------------
+ | Per-cpu pinned | * | >= 0 | 1 |
+ --------------------------------------------------------
+ | Per-process pinned | >= 0 | * | 1 |
+ --------------------------------------------------------
+ | Per-cpu flexible | * | >= 0 | 0 |
+ --------------------------------------------------------
+ | Per-process flexible | >= 0 | * | 0 |
+ --------------------------------------------------------
+
+perf_event abstract::
+
+ struct perf_event {
+ struct list_head event_entry;
+ ......
+ struct pmu *pmu;
+ enum perf_event_state state;
+ local64_t count;
+ u64 total_time_enabled;
+ u64 total_time_running;
+ struct perf_event_attr attr;
+ ......
+ }
+
+For per-cpu perf event, it is linked into per cpu global variable
+perf_cpu_context, for per-process perf event, it is linked into
+task_struct->perf_event_context.
+
+Usually the following cases cause perf event reschedule:
+1) In a context switch from one task to a different task.
+2) When an event is manually enabled.
+3) A call to perf_event_open() with disabled field of the
+perf_event_attr argument set to 0.
+
+When perf_event_open() or perf_event_enable() is called, perf event
+reschedule is needed on a specific cpu, perf will send an IPI to the
+target cpu, and the IPI handler will activate events ordered by event
+type, and will iterate all the eligible events in per cpu gloable
+variable perf_cpu_context and current->perf_event_context.
+
+When a perf event is sched out, this event mapped counter is disabled,
+and the counter's setting and count value are saved. When a perf event
+is sched in, perf driver assigns a counter to this event, the counter's
+setting and count values are restored from last saved.
+
+If the event could not be scheduled because no resource is available for
+it, pinned event goes into error state and is excluded from perf
+scheduler, the only way to recover it is re-enable it, flexible event
+goes into inactive state and can be multiplexed with other events if
+needed.
+
+
+3. Arch PMU virtualization
+==========================
+
+3.1. Overview
+-------------
+
+Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest
+PMU driver would access the Arch PMU MSRs (including Fixed and GP
+counter) as the host does. All the guest Arch PMU MSRs accessing are
+interceptable.
+
+When a guest virtual counter is enabled through guest MSR writing, the
+KVM trap will create a kvm perf event through the perf subsystem. The
+kvm perf event's attribute is gotten from the guest virtual counter's
+MSR setting.
+
+When a guest changes the virtual counter's setting later, the KVM trap
+will release the old kvm perf event then create a new kvm perf event
+with the new setting.
+
+When guest read the virtual counter's count number, the kvm trap will
+read kvm perf event's counter value and accumulate it to the previous
+counter value.
+
+When guest no longer access the virtual counter's MSR within a
+scheduling time slice and the virtual counter is disabled, KVM will
+release the kvm perf event.
+
+vPMU diagram::
+
+ ----------------------------
+ | Guest |
+ | perf subsystem |
+ ----------------------------
+ | ^
+ vMSR | | vPMI
+ v |
+ ----------------------------
+ | vPMU KVM vCPU |
+ ----------------------------
+ | ^
+ Call | | Callbacks
+ v |
+ ---------------------------
+ | Host Linux Kernel |
+ | perf subsystem |
+ ---------------------------
+ | ^
+ MSR | | PMI
+ v |
+ --------------------
+ | PMU CPU |
+ --------------------
+
+Each guest virtual counter has a corresponding kvm perf event, and the
+kvm perf event joins host perf scheduler and complies with host perf
+scheduler rule. When kvm perf event is scheduled by host perf scheduler
+and is active, the guest virtual counter could supply the correct value.
+However, if another host perf event comes in and takes over the kvm perf
+event resource, the kvm perf event will be in error state, then the
+virtual counter keeps the saved value when the kvm perf event is preempted.
+But guest perf doesn't notice the underbeach virtual counter is stopped, so
+the final guest profiling data is wrong.
+
+3.2. Host and Guest perf event contention
+-----------------------------------------
+
+Kvm perf event is a per-process pinned event, its priority is second.
+When kvm perf event is active, it can be preempted by host per-cpu
+pinned perf event, or it can preempt host flexible perf events. Such
+preemption can be temporarily prohibited through disabling host IRQ.
+
+The following results are expected when host and guest perf event
+coexist according to perf scheduler rule:
+1). if host per cpu pinned events occupy all the HW resource, kvm perf
+event can not be active as no available resource, the virtual counter
+value is zero always when the guest reads it.
+2). if host per cpu pinned event release HW resource, and kvm perf event
+is in error state, kvm perf event can claim the HW resource and switch into
+active, then the guest can get the correct value from the guest virtual
+counter during kvm perf event is active, but the guest total counter
+value is not correct since counter value is lost during kvm perf event
+is in error state.
+3). if kvm perf event is active, then host per cpu pinned perf event
+becomes active and reclaims kvm perf event resource, kvm perf event will
+be in error state. Finally the virtual counter value is kept unchanged and
+stores previous saved value when the guest reads it. So the guest total
+counter isn't correct.
+4). If host flexible perf events occupy all the HW resource, kvm perf
+event can be active and preempts host flexible perf event resource,
+the guest can get the correct value from the guest virtual counter.
+5). if kvm perf event is active, then other host flexible perf events
+request to active, kvm perf event still own the resource and active, so
+the guest can get the correct value from the guest virtual counter.
+
+3.3. vPMU Arch Gaps
+-------------------
+
+The coexist of host and guest perf events has gaps:
+1). when guest accesses PMU MSRs at the first time, KVM will trap it and
+create kvm perf event, but this event may be not active because the
+contention with host perf event. But guest doesn't notice this and when
+guest read virtual counter, the return value is zero.
+2). when kvm perf event is active, host per-cpu pinned perf event can
+reclaim kvm perf event resource at any time once resource contention
+happens. But guest doesn't notice this neither and guest following
+counter accesses get wrong data.
+So maillist had some discussion titled "Reconsider the current approach
+of vPMU".
+
+https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@xxxxxxxxx/
+
+The major suggestion in this discussion is host pass-through some
+counters into guest, but this suggestion is not feasible, the reasons
+are:
+a. processor has several counters, but counters are not equal, some
+event must bind with a specific counter.
+b. if a special counter is passthrough into guest, host can not support
+such events and lose some capability.
+c. if a normal counter is passthrough into guest, guest can support
+general event only, and the guest has limited capability.
+So both host and guest lose capability in pass-through mode.
+
+4. LBR Virtualization
+=====================
+
+4.1. Overview
+-------------
+
+The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR
+and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability
+into guest, The first guest access on LBR related MSRs is always
+interceptable. The KVM trap would create a vLBR perf event which enables
+the callstack mode and none of the hardware counters are assigned. The
+host perf would enable and schedule this event as usual.
+
+When vLBR event is scheduled by host perf scheduler and is active, host
+LBR MSRs are owned by guest and are pass-through into guest, guest will
+access them without VM Exit. However, if another host LBR event comes in
+and takes over the LBR facility, the vLBR event will be in error state,
+and the guest following access to the LBR MSRs will be trapped and
+meaningless.
+
+As kvm perf event, vLBR event will be released when guest doesn't access
+LBR-related MSRs within a scheduling time slice and guest unset LBR
+enable bit, then the pass-through state of the LBR MSRs will be canceled.
+
+4.2. Host and Guest LBR contention
+----------------------------------
+
+vLBR event is a per-process pinned event, its priority is second. vLBR
+event together with host other LBR event to contend LBR resource,
+according to perf scheduler rule, when vLBR event is active, it can be
+preempted by host per-cpu pinned LBR event, or it can preempt host
+flexible LBR event. Such preemption can be temporarily prohibited
+through disabling host IRQ as perf scheduler uses IPI to change LBR owner.
+
+The following results are expected when host and guest LBR event coexist:
+1) If host per cpu pinned LBR event is active when vm starts, the guest
+vLBR event can not preempt the LBR resource, so the guest can not use
+LBR.
+2). If host flexible LBR events are active when vm starts, guest vLBR
+event can preempt LBR, so the guest can use LBR.
+3). If host per cpu pinned LBR event becomes enabled when guest vLBR
+event is active, the guest vLBR event will lose LBR and the guest can
+not use LBR anymore.
+4). If host flexible LBR event becomes enabled when guest vLBR event is
+active, the guest vLBR event keeps LBR, the guest can still use LBR.
+5). If host per cpu pinned LBR event turns off when guest vLBR event is
+not active, guest vLBR event can be active and own LBR, the guest can use
+LBR.
+
+4.3. vLBR Arch Gaps
+-------------------
+
+Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned
+event at any time, or vLBR event is not active at creation, but guest
+can not notice this, so the guest will get meaningless value when the
+vLBR event is not active.
base-commit: 88bb466c9dec4f70d682cf38c685324e7b1b3d60