RE: [PATCH] Documentation: KVM: Add vPMU implementaion and gap document

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On 7/24/2023 6:41 PM, Xiong Zhang wrote:
> > Add a vPMU implementation and gap document to explain vArch PMU and
> > vLBR implementation in kvm, especially the current gap to support host
> > and guest perf event coexist.
> >
> > Signed-off-by: Xiong Zhang <xiong.y.zhang@xxxxxxxxx>
> > ---
> >   Documentation/virt/kvm/x86/index.rst |   1 +
> >   Documentation/virt/kvm/x86/pmu.rst   | 249
> +++++++++++++++++++++++++++
> >   2 files changed, 250 insertions(+)
> >   create mode 100644 Documentation/virt/kvm/x86/pmu.rst
> >
> > diff --git a/Documentation/virt/kvm/x86/index.rst
> > b/Documentation/virt/kvm/x86/index.rst
> > index 9ece6b8dc817..02c1c7b01bf3 100644
> > --- a/Documentation/virt/kvm/x86/index.rst
> > +++ b/Documentation/virt/kvm/x86/index.rst
> > @@ -14,5 +14,6 @@ KVM for x86 systems
> >      mmu
> >      msr
> >      nested-vmx
> > +   pmu
> >      running-nested-guests
> >      timekeeping
> > diff --git a/Documentation/virt/kvm/x86/pmu.rst
> > b/Documentation/virt/kvm/x86/pmu.rst
> > new file mode 100644
> > index 000000000000..e95e8c88e0e0
> > --- /dev/null
> > +++ b/Documentation/virt/kvm/x86/pmu.rst
> > @@ -0,0 +1,249 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +
> > +==========================
> > +PMU virtualization for X86
> > +==========================
> > +
> > +:Author: Xiong Zhang <xiong.y.zhang@xxxxxxxxx>
> > +:Copyright: (c) 2023, Intel.  All rights reserved.
> > +
> > +.. Contents
> > +
> > +1. Overview
> > +2. Perf Scheduler
> > +3. Arch PMU virtualization
> > +4. LBR virtualization
> > +
> > +1. Overview
> > +===========
> > +
> > +KVM has supported PMU virtualization on x86 for many years and
> > +provides MSR based Arch PMU interface to the guest. The major
> > +features include Arch PMU v2, LBR and PEBS. Users have the same
> > +operation to profile performance in guest and host.
> > +KVM is a normal perf subsystem user as other perf subsystem users.
> > +When the guest access vPMU MSRs, KVM traps it and creates a perf event for
> it.
> > +This perf event takes part in perf scheduler to request PMU resources
> > +and let the guest use these resources.
> > +
> > +This document describes the X86 PMU virtualization architecture
> > +design and opens. It is organized as follows: Next section describes
> > +more details of Linux perf scheduler as it takes a key role in vPMU
> > +implementation and allocates PMU resources for guest usage. Then Arch
> > +PMU virtualization and LBR virtualization are introduced, each
> > +feature has sections to introduce implementation overview,  the
> > +expectation and gaps when host and guest perf events coexist.
> > +
> > +2. Perf Scheduler
> > +=================
> > +
> > +Perf scheduler's responsibility is choosing which events are active
> > +at one moment and binding counter with perf event. As processor has
> > +limited PMU counters and other resource, only limited perf events can
> > +be active at one moment, the inactive perf event may be active in the
> > +next moment, perf scheduler has defined rules to control these things.
> > +
> > +Usually the following cases cause perf event reschedule:
> > +1) On a context switch from one task to a different task.
> > +2) When an event is manually enabled.
> > +3) A call to perf_event_open() with disabled field of the
> > +perf_event_attr argument set to 0.
> 
> And when perf scheduler timer expires.
[Zhang, Xiong Y] yes, when perf_mux_hrtimer expires, perf will reschedule perf
events. But I'm hesitated whether it should be added or not ? perf_mux_hrtimer is
used for flexible events when counter multiplex happens, it doesn't have much
relationship with kvm pinned events. If perf_mux_hrtimer is added here, perf
multiplex should be introduced also. this perf scheduler section help reader
understanding kvm perf event, it isn't fully perf scheduler doc.
Except perf_mux_hrtimer, more corner cases will cause perf event reschedule and are
not list here.
> 
> > +
> > +When perf event reschedule is needed on a specific cpu, perf will
> > +send an IPI to the target cpu, and the IPI handler will activate
> > +events ordered by event type, and will iterate all the eligible events.
> 
> IIUC, this is only true for the event create case, not for all above reschedule cases.
[Zhang, Xiong Y] yes, perf_event_open() and perf_event_enable() send IPI, but
task_switch and perf_mux_hrtimer won't send IPI, I will modify this sentence.
> 
> > +
> > +When a perf event is sched out, this event mapped counter is
> > +disabled, and the counter's setting and count value are saved. When a
> > +perf event is sched in, perf driver assigns a counter to this event,
> > +the counter's setting and count values are restored from last saved.
> > +
> > +Perf defines four types event, their priority are from high to low:
> > +a. Per-cpu pinned: the event should be measured on the specified
> > +logical core whenever it is enabled.
> > +b. Per-process pinned: the event should be measured whenever it is
> > +enabled and the process is running on any logical cores.
> > +c. Per-cpu flexible: the event should measured on the specified
> > +logical core whenever it is enabled.
> > +d. Per-process flexible: the event should be measured whenever it is
> > +enabled and the process is running on any logical cores.
> > +
> > +If the event could not be scheduled because no resource is available
> > +for it, pinned event goes into error state and is excluded from perf
> > +scheduler, the only way to recover it is re-enable it, flexible event
> > +goes into inactive state and can be multiplexed with other events if
> > +needed.
> 
> Maybe you can add some diagrams or list some key definitions/data
> structures/prototypes
> 
> to facilitate readers to understand more about perf schedule since it's the key of
> perf subsystem.
[Zhang, Xiong Y] I will try to add some diagrams. 
> 
> > +
> > +3. Arch PMU virtualization
> > +==========================
> > +
> > +3.1. Overview
> > +-------------
> > +
> > +Once KVM/QEMU expose vcpu's Arch PMU capability into guest, the guest
> > +PMU driver would access the Arch PMU MSRs (including Fixed and GP
> > +counter) as the host does. All the guest Arch PMU MSRs accessing are
> > +interceptable.
> > +
> > +When a guest virtual counter is enabled through guest MSR writing,
> > +the KVM trap will create a kvm perf event through the perf subsystem.
> > +The kvm perf event's attribute is gotten from the guest virtual
> > +counter's MSR setting.
> > +
> > +When a guest changes the virtual counter's setting later, the KVM
> > +trap will release the old kvm perf event then create a new kvm perf
> > +event with the new setting.
> > +
> > +When guest read the virtual counter's count number, the kvm trap will
> > +read kvm perf event's counter value and accumulate it to the previous
> > +counter value.
> > +
> > +When guest no longer access the virtual counter's MSR within a
> > +scheduling time slice and the virtual counter is disabled, KVM will
> > +release the kvm perf event.
> > +  ----------------------------
> > +  |  Guest                   |
> > +  |  perf subsystem          |
> > +  ----------------------------
> > +       |            ^
> > +  vMSR |            | vPMI
> > +       v            |
> > +  ----------------------------
> > +  |  vPMU        KVM vCPU    |
> > +  ----------------------------
> > +        |          ^
> > +  Call  |          | Callbacks
> > +        v          |
> > +  ---------------------------
> > +  | Host Linux Kernel       |
> > +  | perf subsystem          |
> > +  ---------------------------
> > +               |       ^
> > +           MSR |       | PMI
> > +               v       |
> > +         --------------------
> > +	 | PMU        CPU   |
> > +         --------------------
> > +
> > +Each guest virtual counter has a corresponding kvm perf event, and
> > +the kvm perf event joins host perf scheduler and complies with host
> > +perf scheduler rule. When kvm perf event is scheduled by host perf
> > +scheduler and is active, the guest virtual counter could supply the correct
> value.
> > +However, if another host perf event comes in and takes over the kvm
> > +perf event resource, the kvm perf event will be inactive, then the
> > +virtual counter supplies wrong and meaningless value.
> 
> IMHO, the data is still valid for preempted event as it's saved when the event is
> sched_out.
> 
> But it doesn't match the running task under profiling, and this is normal when perf
> 
> preemption exits.
[Zhang, Xiong Y] the virtual counter supplies a saved value when it is preempted.
When preemption happens, perf_event->running_time is stopped, but
perf_event->enabling_time continue increase, so perf could get an estimate
counter value finally. But host perf couldn't notify this preemption into guest
virtual counter, and let guest perf stop guest_perf_event->running_time, so 
guest will get a wrong data. 
> 
> > +
> > +3.2. Host and Guest perf event contention
> > +-----------------------------------------
> > +
> > +Kvm perf event is a per-process pinned event, its priority is second.
> > +When kvm perf event is active, it can be preempted by host per-cpu
> > +pinned perf event, or it can preempt host flexible perf events. Such
> > +preemption can be temporarily prohibited through disabling host IRQ.
> > +
> > +The following results are expected when host and guest perf event
> > +coexist according to perf scheduler rule:
> > +1). if host per cpu pinned events occupy all the HW resource, kvm
> > +perf event can not be active as no available resource, the virtual
> > +counter value is  zero always when the guest read it.
> > +2). if host per cpu pinned event release HW resource, and kvm perf
> > +event is inactive, kvm perf event can claim the HW resource and
> > +switch into active, then the guest can get the correct value from the
> > +guest virtual counter during kvm perf event is active, but the guest
> > +total counter value is not correct since counter value is lost during
> > +kvm perf event is inactive.
> > +3). if kvm perf event is active, then host per cpu pinned perf event
> > +becomes active and reclaims kvm perf event resource, kvm perf event
> > +will be inactive. Finally the virtual counter value is kept unchanged
> > +and stores previous saved value when the guest reads it. So the guest
> > +toatal counter isn't correct.
> > +4). If host flexible perf events occupy all the HW resource, kvm perf
> > +event can be active and preempts host flexible perf event resource,
> > +guest can get the correct value from the guest virtual counter.
> > +5). if kvm perf event is active, then other host flexible perf events
> > +request to active, kvm perf event still own the resource and active,
> > +so guest can get the correct value from the guest virtual counter.
> > +
> > +3.3. vPMU Arch Gaps
> > +-------------------
> > +
> > +The coexist of host and guest perf events has gap:
> > +1). when guest accesses PMU MSRs at the first time, KVM will trap it
> > +and create kvm perf event, but this event may be inactive because the
> > +contention with host perf event. But guest doesn't notice this and
> > +when guest read virtual counter, the return value is zero.
> > +2). when kvm perf event is active, host per-cpu pinned perf event can
> > +reclaim kvm perf event resource at any time once resource contention
> > +happens. But guest doesn't notice this neither and guest following
> > +counter accesses get wrong data.
> > +So maillist had some discussion titled "Reconsider the current
> > +approach of vPMU".
> > +
> > +https://lore.kernel.org/lkml/810c3148-1791-de57-27c0-d1ac5ed35fb8@gma
> > +il.com/
> > +
> > +The major suggestion in this discussion is host pass-through some
> > +counters into guest, but this suggestion is not feasible, the reasons
> > +are:
> > +a. processor has several counters, but counters are not equal, some
> > +event must bind with a specific counter.
> > +b. if a special counter is passthrough into guest, host can not
> > +support such event and lose some capability.
> > +c. if a normal counter is passthrough into guest, guest can support
> > +general event only, and the guest has limited capability.
> > +So both host and guest lose capability in pass-through mode.
> > +
> > +4. LBR Virtualization
> > +=====================
> > +
> > +4.1. Overview
> > +-------------
> > +
> > +The guest LBR driver would access the LBR MSR (including
> > +IA32_DEBUGCTLMSR and records MSRs) as host does once KVM/QEMU
> export
> > +vcpu's LBR capability into guest,  The first guest access on LBR
> > +related MSRs is always interceptable. The KVM trap would create a
> > +vLBR perf event which enables the callstack mode and none of the
> > +hardware counters are assigned. The host perf would enable and schedule this
> event as usual.
> > +
> > +When vLBR event is scheduled by host perf scheduler and is active,
> > +host LBR MSRs are owned by guest and are pass-through into guest,
> > +guest will access them without VM Exit. However, if another host LBR
> > +event comes in and takes over the LBR facility, the vLBR event will
> > +be inactive, and guest following accesses to the LBR MSRs will be trapped and
> meaningless.
> 
> Is this true only when host created a pinned LBR event? Otherwise, it won't
> preempt
> 
> the guest vLBR.
[Zhang, Xiong Y] yes, host could create per cpu pinned LBR event, like
perf record -b -a -e Instructions:D

thanks
> 
> 
> > +
> > +As kvm perf event, vLBR event will be released when guest doesn't
> > +access LBR-related MSRs within a scheduling time slice and guest
> > +unset LBR enable bit, then the pass-through state of the LBR MSRs will be
> canceled.
> > +
> > +4.2. Host and Guest LBR contention
> > +----------------------------------
> > +
> > +vLBR event is a per-process pinned event, its priority is second.
> > +vLBR event together with host other LBR event to contend LBR
> > +resource, according to perf scheduler rule, when vLBR event is
> > +active, it can be preempted by host per-cpu pinned LBR event, or it
> > +can preempt host flexible LBR event. Such preemption can be
> > +temporarily prohibited through disabling host IRQ as perf scheduler uses IPI to
> change LBR owner.
> > +
> > +The following results are expected when host and guest LBR event coexist:
> > +1) If host per cpu pinned LBR event is active when vm starts, the
> > +guest vLBR event can not preempt the LBR resource, so the guest can
> > +not use LBR.
> > +2). If host flexible LBR events are active when vm starts, guest vLBR
> > +event can preempt LBR, so the guest can use LBR.
> > +3). If host per cpu pinned LBR event becomes enabled when guest vLBR
> > +event is active, the guest vLBR event will lose LBR and the guest can
> > +not use LBR anymore.
> > +4). If host flexible LBR event becomes enabled when guest vLBR event
> > +is active, the guest vLBR event keeps LBR, the guest can still use LBR.
> > +5). If host per cpu pinned LBR event becomes inactive when guest vLBR
> > +event is inactive, guest vLBR event can be active and own LBR, the
> > +guest can use LBR.
> 
> Anyway, vLBR problems is still induced by perf scheduling priorities, if you can
> 
> clearly state current gaps of vPMU, it's also clear for vLBR issue, then this section
> 
> could be omitted.
> 
> > +
> > +4.3. vLBR Arch Gaps
> > +-------------------
> > +
> > +Like vPMU Arch Gap, vLBR event can be preempted by host Per cpu
> > +pinned event at any time, or vLBR event is inactive at creation, but
> > +guest can not notice this, so the guest will get meaningless value
> > +when the vLBR event is inactive.




[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux