Add a vPMU implementation and gap document to explain vArch PMU and vLBR implementation in kvm, especially the current gap to support host and guest perf event coexist. Signed-off-by: Xiong Zhang <xiong.y.zhang@xxxxxxxxx> --- Documentation/virt/kvm/x86/index.rst | 1 + Documentation/virt/kvm/x86/pmu.rst | 249 +++++++++++++++++++++++++++ 2 files changed, 250 insertions(+) create mode 100644 Documentation/virt/kvm/x86/pmu.rst diff --git a/Documentation/virt/kvm/x86/index.rst b/Documentation/virt/kvm/x86/index.rst index 9ece6b8dc817..02c1c7b01bf3 100644 --- a/Documentation/virt/kvm/x86/index.rst +++ b/Documentation/virt/kvm/x86/index.rst @@ -14,5 +14,6 @@ KVM for x86 systems mmu msr nested-vmx + pmu running-nested-guests timekeeping diff --git a/Documentation/virt/kvm/x86/pmu.rst b/Documentation/virt/kvm/x86/pmu.rst new file mode 100644 index 000000000000..e95e8c88e0e0 --- /dev/null +++ b/Documentation/virt/kvm/x86/pmu.rst @@ -0,0 +1,249 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================== +PMU virtualization for X86 +========================== + +:Author: Xiong Zhang <xiong.y.zhang@xxxxxxxxx> +:Copyright: (c) 2023, Intel. All rights reserved. + +.. Contents + +1. Overview +2. Perf Scheduler +3. Arch PMU virtualization +4. LBR virtualization + +1. Overview +=========== + +KVM has supported PMU virtualization on x86 for many years and provides +MSR based Arch PMU interface to the guest. The major features include +Arch PMU v2, LBR and PEBS. Users have the same operation to profile +performance in guest and host. +KVM is a normal perf subsystem user as other perf subsystem users. When +the guest access vPMU MSRs, KVM traps it and creates a perf event for it. +This perf event takes part in perf scheduler to request PMU resources +and let the guest use these resources. + +This document describes the X86 PMU virtualization architecture design +and opens. It is organized as follows: Next section describes more +details of Linux perf scheduler as it takes a key role in vPMU +implementation and allocates PMU resources for guest usage. Then Arch +PMU virtualization and LBR virtualization are introduced, each feature +has sections to introduce implementation overview, the expectation and +gaps when host and guest perf events coexist. + +2. Perf Scheduler +================= + +Perf scheduler's responsibility is choosing which events are active at +one moment and binding counter with perf event. As processor has limited +PMU counters and other resource, only limited perf events can be active +at one moment, the inactive perf event may be active in the next moment, +perf scheduler has defined rules to control these things. + +Usually the following cases cause perf event reschedule: +1) On a context switch from one task to a different task. +2) When an event is manually enabled. +3) A call to perf_event_open() with disabled field of the +perf_event_attr argument set to 0. + +When perf event reschedule is needed on a specific cpu, perf will send +an IPI to the target cpu, and the IPI handler will activate events +ordered by event type, and will iterate all the eligible events. + +When a perf event is sched out, this event mapped counter is disabled, +and the counter's setting and count value are saved. When a perf event +is sched in, perf driver assigns a counter to this event, the counter's +setting and count values are restored from last saved. + +Perf defines four types event, their priority are from high to low: +a. Per-cpu pinned: the event should be measured on the specified logical +core whenever it is enabled. +b. Per-process pinned: the event should be measured whenever it is +enabled and the process is running on any logical cores. +c. Per-cpu flexible: the event should measured on the specified logical +core whenever it is enabled. +d. Per-process flexible: the event should be measured whenever it is +enabled and the process is running on any logical cores. + +If the event could not be scheduled because no resource is available for +it, pinned event goes into error state and is excluded from perf +scheduler, the only way to recover it is re-enable it, flexible event +goes into inactive state and can be multiplexed with other events if +needed. + +3. Arch PMU virtualization +========================== + +3.1. Overview +------------- + +Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest +PMU driver would access the Arch PMU MSRs (including Fixed and GP +counter) as the host does. All the guest Arch PMU MSRs accessing are +interceptable. + +When a guest virtual counter is enabled through guest MSR writing, the +KVM trap will create a kvm perf event through the perf subsystem. The +kvm perf event's attribute is gotten from the guest virtual counter's +MSR setting. + +When a guest changes the virtual counter's setting later, the KVM trap +will release the old kvm perf event then create a new kvm perf event +with the new setting. + +When guest read the virtual counter's count number, the kvm trap will +read kvm perf event's counter value and accumulate it to the previous +counter value. + +When guest no longer access the virtual counter's MSR within a +scheduling time slice and the virtual counter is disabled, KVM will +release the kvm perf event. + ---------------------------- + | Guest | + | perf subsystem | + ---------------------------- + | ^ + vMSR | | vPMI + v | + ---------------------------- + | vPMU KVM vCPU | + ---------------------------- + | ^ + Call | | Callbacks + v | + --------------------------- + | Host Linux Kernel | + | perf subsystem | + --------------------------- + | ^ + MSR | | PMI + v | + -------------------- + | PMU CPU | + -------------------- + +Each guest virtual counter has a corresponding kvm perf event, and the +kvm perf event joins host perf scheduler and complies with host perf +scheduler rule. When kvm perf event is scheduled by host perf scheduler +and is active, the guest virtual counter could supply the correct value. +However, if another host perf event comes in and takes over the kvm perf +event resource, the kvm perf event will be inactive, then the virtual +counter supplies wrong and meaningless value. + +3.2. Host and Guest perf event contention +----------------------------------------- + +Kvm perf event is a per-process pinned event, its priority is second. +When kvm perf event is active, it can be preempted by host per-cpu +pinned perf event, or it can preempt host flexible perf events. Such +preemption can be temporarily prohibited through disabling host IRQ. + +The following results are expected when host and guest perf event +coexist according to perf scheduler rule: +1). if host per cpu pinned events occupy all the HW resource, kvm perf +event can not be active as no available resource, the virtual counter +value is zero always when the guest read it. +2). if host per cpu pinned event release HW resource, and kvm perf event +is inactive, kvm perf event can claim the HW resource and switch into +active, then the guest can get the correct value from the guest virtual +counter during kvm perf event is active, but the guest total counter +value is not correct since counter value is lost during kvm perf event +is inactive. +3). if kvm perf event is active, then host per cpu pinned perf event +becomes active and reclaims kvm perf event resource, kvm perf event will +be inactive. Finally the virtual counter value is kept unchanged and +stores previous saved value when the guest reads it. So the guest toatal +counter isn't correct. +4). If host flexible perf events occupy all the HW resource, kvm perf +event can be active and preempts host flexible perf event resource, +guest can get the correct value from the guest virtual counter. +5). if kvm perf event is active, then other host flexible perf events +request to active, kvm perf event still own the resource and active, so +guest can get the correct value from the guest virtual counter. + +3.3. vPMU Arch Gaps +------------------- + +The coexist of host and guest perf events has gap: +1). when guest accesses PMU MSRs at the first time, KVM will trap it and +create kvm perf event, but this event may be inactive because the +contention with host perf event. But guest doesn't notice this and when +guest read virtual counter, the return value is zero. +2). when kvm perf event is active, host per-cpu pinned perf event can +reclaim kvm perf event resource at any time once resource contention +happens. But guest doesn't notice this neither and guest following +counter accesses get wrong data. +So maillist had some discussion titled "Reconsider the current approach +of vPMU". + +https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@xxxxxxxxx/ + +The major suggestion in this discussion is host pass-through some +counters into guest, but this suggestion is not feasible, the reasons +are: +a. processor has several counters, but counters are not equal, some +event must bind with a specific counter. +b. if a special counter is passthrough into guest, host can not support +such event and lose some capability. +c. if a normal counter is passthrough into guest, guest can support +general event only, and the guest has limited capability. +So both host and guest lose capability in pass-through mode. + +4. LBR Virtualization +===================== + +4.1. Overview +------------- + +The guest LBR driver would access the LBR MSR (including IA32_DEBUGCTLMSR +and records MSRs) as host does once KVM/QEMU export vcpu's LBR capability +into guest, The first guest access on LBR related MSRs is always +interceptable. The KVM trap would create a vLBR perf event which enables +the callstack mode and none of the hardware counters are assigned. The +host perf would enable and schedule this event as usual. + +When vLBR event is scheduled by host perf scheduler and is active, host +LBR MSRs are owned by guest and are pass-through into guest, guest will +access them without VM Exit. However, if another host LBR event comes in +and takes over the LBR facility, the vLBR event will be inactive, and +guest following accesses to the LBR MSRs will be trapped and meaningless. + +As kvm perf event, vLBR event will be released when guest doesn't access +LBR-related MSRs within a scheduling time slice and guest unset LBR +enable bit, then the pass-through state of the LBR MSRs will be canceled. + +4.2. Host and Guest LBR contention +---------------------------------- + +vLBR event is a per-process pinned event, its priority is second. vLBR +event together with host other LBR event to contend LBR resource, +according to perf scheduler rule, when vLBR event is active, it can be +preempted by host per-cpu pinned LBR event, or it can preempt host +flexible LBR event. Such preemption can be temporarily prohibited +through disabling host IRQ as perf scheduler uses IPI to change LBR owner. + +The following results are expected when host and guest LBR event coexist: +1) If host per cpu pinned LBR event is active when vm starts, the guest +vLBR event can not preempt the LBR resource, so the guest can not use +LBR. +2). If host flexible LBR events are active when vm starts, guest vLBR +event can preempt LBR, so the guest can use LBR. +3). If host per cpu pinned LBR event becomes enabled when guest vLBR +event is active, the guest vLBR event will lose LBR and the guest can +not use LBR anymore. +4). If host flexible LBR event becomes enabled when guest vLBR event is +active, the guest vLBR event keeps LBR, the guest can still use LBR. +5). If host per cpu pinned LBR event becomes inactive when guest vLBR +event is inactive, guest vLBR event can be active and own LBR, the guest +can use LBR. + +4.3. vLBR Arch Gaps +------------------- + +Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu pinned +event at any time, or vLBR event is inactive at creation, but guest +can not notice this, so the guest will get meaningless value when the +vLBR event is inactive. -- 2.25.1