> On 7/24/2023 6:41 PM, Xiong Zhang wrote: > > Add a vPMU implementation and gap document to explain vArch PMU and > > vLBR implementation in kvm, especially the current gap to support host > > and guest perf event coexist. > > > > Signed-off-by: Xiong Zhang <xiong.y.zhang@xxxxxxxxx> > > --- > > Documentation/virt/kvm/x86/index.rst | 1 + > > Documentation/virt/kvm/x86/pmu.rst | 249 > +++++++++++++++++++++++++++ > > 2 files changed, 250 insertions(+) > > create mode 100644 Documentation/virt/kvm/x86/pmu.rst > > > > diff --git a/Documentation/virt/kvm/x86/index.rst > > b/Documentation/virt/kvm/x86/index.rst > > index 9ece6b8dc817..02c1c7b01bf3 100644 > > --- a/Documentation/virt/kvm/x86/index.rst > > +++ b/Documentation/virt/kvm/x86/index.rst > > @@ -14,5 +14,6 @@ KVM for x86 systems > > mmu > > msr > > nested-vmx > > + pmu > > running-nested-guests > > timekeeping > > diff --git a/Documentation/virt/kvm/x86/pmu.rst > > b/Documentation/virt/kvm/x86/pmu.rst > > new file mode 100644 > > index 000000000000..e95e8c88e0e0 > > --- /dev/null > > +++ b/Documentation/virt/kvm/x86/pmu.rst > > @@ -0,0 +1,249 @@ > > +.. SPDX-License-Identifier: GPL-2.0 > > + > > +========================== > > +PMU virtualization for X86 > > +========================== > > + > > +:Author: Xiong Zhang <xiong.y.zhang@xxxxxxxxx> > > +:Copyright: (c) 2023, Intel. All rights reserved. > > + > > +.. Contents > > + > > +1. Overview > > +2. Perf Scheduler > > +3. Arch PMU virtualization > > +4. LBR virtualization > > + > > +1. Overview > > +=========== > > + > > +KVM has supported PMU virtualization on x86 for many years and > > +provides MSR based Arch PMU interface to the guest. The major > > +features include Arch PMU v2, LBR and PEBS. Users have the same > > +operation to profile performance in guest and host. > > +KVM is a normal perf subsystem user as other perf subsystem users. > > +When the guest access vPMU MSRs, KVM traps it and creates a perf event for > it. > > +This perf event takes part in perf scheduler to request PMU resources > > +and let the guest use these resources. > > + > > +This document describes the X86 PMU virtualization architecture > > +design and opens. It is organized as follows: Next section describes > > +more details of Linux perf scheduler as it takes a key role in vPMU > > +implementation and allocates PMU resources for guest usage. Then Arch > > +PMU virtualization and LBR virtualization are introduced, each > > +feature has sections to introduce implementation overview, the > > +expectation and gaps when host and guest perf events coexist. > > + > > +2. Perf Scheduler > > +================= > > + > > +Perf scheduler's responsibility is choosing which events are active > > +at one moment and binding counter with perf event. As processor has > > +limited PMU counters and other resource, only limited perf events can > > +be active at one moment, the inactive perf event may be active in the > > +next moment, perf scheduler has defined rules to control these things. > > + > > +Usually the following cases cause perf event reschedule: > > +1) On a context switch from one task to a different task. > > +2) When an event is manually enabled. > > +3) A call to perf_event_open() with disabled field of the > > +perf_event_attr argument set to 0. > > And when perf scheduler timer expires. [Zhang, Xiong Y] yes, when perf_mux_hrtimer expires, perf will reschedule perf events. But I'm hesitated whether it should be added or not ? perf_mux_hrtimer is used for flexible events when counter multiplex happens, it doesn't have much relationship with kvm pinned events. If perf_mux_hrtimer is added here, perf multiplex should be introduced also. this perf scheduler section help reader understanding kvm perf event, it isn't fully perf scheduler doc. Except perf_mux_hrtimer, more corner cases will cause perf event reschedule and are not list here. > > > + > > +When perf event reschedule is needed on a specific cpu, perf will > > +send an IPI to the target cpu, and the IPI handler will activate > > +events ordered by event type, and will iterate all the eligible events. > > IIUC, this is only true for the event create case, not for all above reschedule cases. [Zhang, Xiong Y] yes, perf_event_open() and perf_event_enable() send IPI, but task_switch and perf_mux_hrtimer won't send IPI, I will modify this sentence. > > > + > > +When a perf event is sched out, this event mapped counter is > > +disabled, and the counter's setting and count value are saved. When a > > +perf event is sched in, perf driver assigns a counter to this event, > > +the counter's setting and count values are restored from last saved. > > + > > +Perf defines four types event, their priority are from high to low: > > +a. Per-cpu pinned: the event should be measured on the specified > > +logical core whenever it is enabled. > > +b. Per-process pinned: the event should be measured whenever it is > > +enabled and the process is running on any logical cores. > > +c. Per-cpu flexible: the event should measured on the specified > > +logical core whenever it is enabled. > > +d. Per-process flexible: the event should be measured whenever it is > > +enabled and the process is running on any logical cores. > > + > > +If the event could not be scheduled because no resource is available > > +for it, pinned event goes into error state and is excluded from perf > > +scheduler, the only way to recover it is re-enable it, flexible event > > +goes into inactive state and can be multiplexed with other events if > > +needed. > > Maybe you can add some diagrams or list some key definitions/data > structures/prototypes > > to facilitate readers to understand more about perf schedule since it's the key of > perf subsystem. [Zhang, Xiong Y] I will try to add some diagrams. > > > + > > +3. Arch PMU virtualization > > +========================== > > + > > +3.1. Overview > > +------------- > > + > > +Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest > > +PMU driver would access the Arch PMU MSRs (including Fixed and GP > > +counter) as the host does. All the guest Arch PMU MSRs accessing are > > +interceptable. > > + > > +When a guest virtual counter is enabled through guest MSR writing, > > +the KVM trap will create a kvm perf event through the perf subsystem. > > +The kvm perf event's attribute is gotten from the guest virtual > > +counter's MSR setting. > > + > > +When a guest changes the virtual counter's setting later, the KVM > > +trap will release the old kvm perf event then create a new kvm perf > > +event with the new setting. > > + > > +When guest read the virtual counter's count number, the kvm trap will > > +read kvm perf event's counter value and accumulate it to the previous > > +counter value. > > + > > +When guest no longer access the virtual counter's MSR within a > > +scheduling time slice and the virtual counter is disabled, KVM will > > +release the kvm perf event. > > + ---------------------------- > > + | Guest | > > + | perf subsystem | > > + ---------------------------- > > + | ^ > > + vMSR | | vPMI > > + v | > > + ---------------------------- > > + | vPMU KVM vCPU | > > + ---------------------------- > > + | ^ > > + Call | | Callbacks > > + v | > > + --------------------------- > > + | Host Linux Kernel | > > + | perf subsystem | > > + --------------------------- > > + | ^ > > + MSR | | PMI > > + v | > > + -------------------- > > + | PMU CPU | > > + -------------------- > > + > > +Each guest virtual counter has a corresponding kvm perf event, and > > +the kvm perf event joins host perf scheduler and complies with host > > +perf scheduler rule. When kvm perf event is scheduled by host perf > > +scheduler and is active, the guest virtual counter could supply the correct > value. > > +However, if another host perf event comes in and takes over the kvm > > +perf event resource, the kvm perf event will be inactive, then the > > +virtual counter supplies wrong and meaningless value. > > IMHO, the data is still valid for preempted event as it's saved when the event is > sched_out. > > But it doesn't match the running task under profiling, and this is normal when perf > > preemption exits. [Zhang, Xiong Y] the virtual counter supplies a saved value when it is preempted. When preemption happens, perf_event->running_time is stopped, but perf_event->enabling_time continue increase, so perf could get an estimate counter value finally. But host perf couldn't notify this preemption into guest virtual counter, and let guest perf stop guest_perf_event->running_time, so guest will get a wrong data. > > > + > > +3.2. Host and Guest perf event contention > > +----------------------------------------- > > + > > +Kvm perf event is a per-process pinned event, its priority is second. > > +When kvm perf event is active, it can be preempted by host per-cpu > > +pinned perf event, or it can preempt host flexible perf events. Such > > +preemption can be temporarily prohibited through disabling host IRQ. > > + > > +The following results are expected when host and guest perf event > > +coexist according to perf scheduler rule: > > +1). if host per cpu pinned events occupy all the HW resource, kvm > > +perf event can not be active as no available resource, the virtual > > +counter value is zero always when the guest read it. > > +2). if host per cpu pinned event release HW resource, and kvm perf > > +event is inactive, kvm perf event can claim the HW resource and > > +switch into active, then the guest can get the correct value from the > > +guest virtual counter during kvm perf event is active, but the guest > > +total counter value is not correct since counter value is lost during > > +kvm perf event is inactive. > > +3). if kvm perf event is active, then host per cpu pinned perf event > > +becomes active and reclaims kvm perf event resource, kvm perf event > > +will be inactive. Finally the virtual counter value is kept unchanged > > +and stores previous saved value when the guest reads it. So the guest > > +toatal counter isn't correct. > > +4). If host flexible perf events occupy all the HW resource, kvm perf > > +event can be active and preempts host flexible perf event resource, > > +guest can get the correct value from the guest virtual counter. > > +5). if kvm perf event is active, then other host flexible perf events > > +request to active, kvm perf event still own the resource and active, > > +so guest can get the correct value from the guest virtual counter. > > + > > +3.3. vPMU Arch Gaps > > +------------------- > > + > > +The coexist of host and guest perf events has gap: > > +1). when guest accesses PMU MSRs at the first time, KVM will trap it > > +and create kvm perf event, but this event may be inactive because the > > +contention with host perf event. But guest doesn't notice this and > > +when guest read virtual counter, the return value is zero. > > +2). when kvm perf event is active, host per-cpu pinned perf event can > > +reclaim kvm perf event resource at any time once resource contention > > +happens. But guest doesn't notice this neither and guest following > > +counter accesses get wrong data. > > +So maillist had some discussion titled "Reconsider the current > > +approach of vPMU". > > + > > +https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gma > > +il.com/ > > + > > +The major suggestion in this discussion is host pass-through some > > +counters into guest, but this suggestion is not feasible, the reasons > > +are: > > +a. processor has several counters, but counters are not equal, some > > +event must bind with a specific counter. > > +b. if a special counter is passthrough into guest, host can not > > +support such event and lose some capability. > > +c. if a normal counter is passthrough into guest, guest can support > > +general event only, and the guest has limited capability. > > +So both host and guest lose capability in pass-through mode. > > + > > +4. LBR Virtualization > > +===================== > > + > > +4.1. Overview > > +------------- > > + > > +The guest LBR driver would access the LBR MSR (including > > +IA32_DEBUGCTLMSR and records MSRs) as host does once KVM/QEMU > export > > +vcpu's LBR capability into guest, The first guest access on LBR > > +related MSRs is always interceptable. The KVM trap would create a > > +vLBR perf event which enables the callstack mode and none of the > > +hardware counters are assigned. The host perf would enable and schedule this > event as usual. > > + > > +When vLBR event is scheduled by host perf scheduler and is active, > > +host LBR MSRs are owned by guest and are pass-through into guest, > > +guest will access them without VM Exit. However, if another host LBR > > +event comes in and takes over the LBR facility, the vLBR event will > > +be inactive, and guest following accesses to the LBR MSRs will be trapped and > meaningless. > > Is this true only when host created a pinned LBR event? Otherwise, it won't > preempt > > the guest vLBR. [Zhang, Xiong Y] yes, host could create per cpu pinned LBR event, like perf record -b -a -e Instructions:D thanks > > > > + > > +As kvm perf event, vLBR event will be released when guest doesn't > > +access LBR-related MSRs within a scheduling time slice and guest > > +unset LBR enable bit, then the pass-through state of the LBR MSRs will be > canceled. > > + > > +4.2. Host and Guest LBR contention > > +---------------------------------- > > + > > +vLBR event is a per-process pinned event, its priority is second. > > +vLBR event together with host other LBR event to contend LBR > > +resource, according to perf scheduler rule, when vLBR event is > > +active, it can be preempted by host per-cpu pinned LBR event, or it > > +can preempt host flexible LBR event. Such preemption can be > > +temporarily prohibited through disabling host IRQ as perf scheduler uses IPI to > change LBR owner. > > + > > +The following results are expected when host and guest LBR event coexist: > > +1) If host per cpu pinned LBR event is active when vm starts, the > > +guest vLBR event can not preempt the LBR resource, so the guest can > > +not use LBR. > > +2). If host flexible LBR events are active when vm starts, guest vLBR > > +event can preempt LBR, so the guest can use LBR. > > +3). If host per cpu pinned LBR event becomes enabled when guest vLBR > > +event is active, the guest vLBR event will lose LBR and the guest can > > +not use LBR anymore. > > +4). If host flexible LBR event becomes enabled when guest vLBR event > > +is active, the guest vLBR event keeps LBR, the guest can still use LBR. > > +5). If host per cpu pinned LBR event becomes inactive when guest vLBR > > +event is inactive, guest vLBR event can be active and own LBR, the > > +guest can use LBR. > > Anyway, vLBR problems is still induced by perf scheduling priorities, if you can > > clearly state current gaps of vPMU, it's also clear for vLBR issue, then this section > > could be omitted. > > > + > > +4.3. vLBR Arch Gaps > > +------------------- > > + > > +Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu > > +pinned event at any time, or vLBR event is inactive at creation, but > > +guest can not notice this, so the guest will get meaningless value > > +when the vLBR event is inactive.