[RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM

Xiong Zhang <xiong.y.zhang@xxxxxxxxxxxxxxx> · Fri, 26 Jan 2024 16:54:03 +0800

Background
===
KVM has supported vPMU for years as the emulated vPMU. In particular, KVM
presents a virtual PMU to guest where accesses to PMU get trapped and
converted into perf events. These perf events get scheduled along with
other perf events at the host level, sharing the HW resource. In the
emulated vPMU design, KVM is a client of the perf subsystem and has no
control of the HW PMU resource at host level.

This emulated vPMU has these drawbacks:
1. Poor performance, guest PMU MSR accessing has VM-exit and some has
expensive host perf API call. Once guest PMU is multiplexing its
counters, KVM will waste majority of time re-creating/starting/releasing
KVM perf events, then the guest perf performance is dropped dramatically.
2. Guest perf events's backend may be swapped out or disabled silently.
This is because host perf scheduler treats KVM perf events and other host
perf events equally, they will contend HW resources. KVM perf events will
be inactive when all HW resources have been owned by host perf events.
But KVM can not notify this backend error into guest, this slient error
is a red flag for vPMU as a production.
3. Hard to add new vPMU features. For each vPMU new feature, KVM needs
to emulate new MSRs, this involves perf and kvm two subsystems, mostly
the vendor specific perf API is added and is hard to accept.

The community has discussed these drawbacks for years and reconsidered
current emulated vPMU [1]. In latest discussion [2], both Perf and KVM
x86 community agreed to try a passthrough vPMU. So we co-work with google
engineers to develop this RFC, currently we implement it on Intel CPU
only, and can add other arch's implementation later.
Complete RFC source code can be found in below link:
https://github.com/googleprodkernel/linux-kvm/tree/passthrough-pmu-rfc

Under passthrough vPMU, VM direct access to all HW PMU general purpose
counters and some of the fixed counters, VM has transparency of x86 PMU
HW. All host perf events using x86 PMU are stopped during VM running, and
are restarted at VM-exit. This has the following benefits:
1. Better performance, when guest access x86 PMU MSRs and rdpmc, no VM-exit
and no host perf API call.
2. Guest perf events exclusively own HW resource during guest running. Host
perf events are stopped and give up HW resource at VM-entry, and restart 
runnging after VM-exit.
3. Easier to enable PMU new features. KVM just needs to passthrough new
MSRs and save/restore them at VM-exit and VM-entry, no need to add
perf API.

Note, passthrough vPMU does satisfy the enterprise-level requirement of
secure usage for PMU by intercepting guest access to all event selectors.
But the key problem of passthrough vPMU is that host user loses the
capability to profile guest. If any users want to profile guest from the
host, they should not enable passthrough vPMU mode. Another problem is
the NMI watchdog is not fully functional anymore. Please see design opens
for more details.

Implementation
===
To passthrough host x86 PMU into guest, PMU context switch is mandatory,
this RFC implements this PMU context switch at VM-entry/exit boundary.

At VM-entry:
1. KVM call perf supplied perf_guest_enter() interface, perf stops all
the perf events which use host x86 PMU.
2. KVM call perf supplied perf_guest_switch_to_kvm_pmi_vector() interface,
perf switch PMI vector to a separate kvm_pmi_vector, so that KVM handles
PMI after this point and KVM injects HW PMI into guest.
3. KVM restores guest PMU context.

In order to support KVM PMU filter feature for security, EVENT_SELECT and
FIXED_CTR_CTRL MSRs are intercepted, all other MSRs defined in Architectural
Performance Monitoring spec and rdpmc are passthrough, so guest can access
them without VM exit during guest running, when guest counter overflow
happens, HW PMI is triggered with dedicated kvm_pmi_vector, KVM injects a
virtual PMI into guest through virtual local apic.

At VM-exit:
1. KVM saves and clears guest PMU context.
2. KVM call perf supplied perf_guest_switch_to_host_pmi_vector() interface,
perf switch PMI vector to host NMI, so that host handles PMI after this
point.
3. KVM call perf supplied perf_guest_exit() interface, perf resched all
the perf events, these events stopped at VM-entry will be re-started here.

Design Opens
===
we met some design opens during this POC and seek supporting from
community:

1. host system wide / QEMU events handling during VM running
   At VM-entry, all the host perf events which use host x86 PMU will be
   stopped. These events with attr.exclude_guest = 1 will be stopped here
   and re-started after vm-exit. These events without attr.exclude_guest=1
   will be in error state, and they cannot recovery into active state even
   if the guest stops running. This impacts host perf a lot and request
   host system wide perf events have attr.exclude_guest=1.

   This requests QEMU Process's perf event with attr.exclude_guest=1 also.

   During VM running, perf event creation for system wide and QEMU
   process without attr.exclude_guest=1 fail with -EBUSY. 

2. NMI watchdog
   the perf event for NMI watchdog is a system wide cpu pinned event, it
   will be stopped also during vm running, but it doesn't have
   attr.exclude_guest=1, we add it in this RFC. But this still means NMI
   watchdog loses function during VM running.

   Two candidates exist for replacing perf event of NMI watchdog:
   a. Buddy hardlock detector[3] may be not reliable to replace perf event.
   b. HPET-based hardlock detector [4] isn't in the upstream kernel.

3. Dedicated kvm_pmi_vector
   In emulated vPMU, host PMI handler notify KVM to inject a virtual
   PMI into guest when physical PMI belongs to guest counter. If the
   same mechanism is used in passthrough vPMU and PMI skid exists
   which cause physical PMI belonging to guest happens after VM-exit,
   then the host PMI handler couldn't identify this PMI belongs to
   host or guest.
   So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
   has this vector only. The PMI belonging to host still has an NMI
   vector.

   Without considering PMI skid especially for AMD, the host NMI vector
   could be used for guest PMI also, this method is simpler and doesn't
   need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
   didn't meet the skid PMI issue on modern Intel processors.

4. per-VM passthrough mode configuration
   Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
   it decides vPMU is passthrough mode or emulated mode at kvm module
   load time.
   Do we need the capability of per-VM passthrough mode configuration?
   So an admin can launch some non-passthrough VM and profile these
   non-passthrough VMs in host, but admin still cannot profile all
   the VMs once passthrough VM existence. This means passthrough vPMU
   and emulated vPMU mix on one platform, it has challenges to implement.
   As the commit message in commit 0011, the main challenge is 
   passthrough vPMU and emulated vPMU have different vPMU features, this
   ends up with two different values for kvm_cap.supported_perf_cap, which
   is initialized at module load time. To support it, more refactor is
   needed.

Commits construction
===
0000 ~ 0003: Perf extends exclude_guest to stop perf events during
             guest running.
0004 ~ 0009: Perf interface for dedicated kvm_pmi_vector.
0010 ~ 0032: all passthrough vPMU with PMU context switch at
             VM-entry/exit boundary.
0033 ~ 0037: Intercept EVENT_SELECT and FIXED_CTR_CTRL MSRs for
             KVM PMU filter feature.
0038 ~ 0039: Add emulated instructions to guest counter.
0040 ~ 0041: Fixes for passthrough vPMU live migration and Nested VM.

Performance Data
===
Measure method:
First step: guest run workload without perf, and get basic workload score.
Second step: guest run workload with perf commands, and get perf workload
             score.
Third step: perf overhead to workload is gotten from (first-second)/first.
Finally: compare perf overhead between emulated vPMU and passthrough vPMU.

Workload: Specint-2017
HW platform: Sapphire rapids, 1 socket, 56 cores, no-SMT
Perf command:
a. basic-sampling: perf record -F 1000 -e 6-instructions  -a --overwrite
b. multiplex-sampling: perf record -F 1000 -e 10-instructions -a --overwrite

Guest performance overhead:
---------------------------------------------------------------------------
| Test case          | emulated vPMU | all passthrough | passthrough with |
|                    |               |                 | event filters    |
---------------------------------------------------------------------------
| basic-sampling     |   33.62%      |    4.24%        |   6.21%          |
---------------------------------------------------------------------------
| multiplex-sampling |   79.32%      |    7.34%        |   10.45%         |
---------------------------------------------------------------------------
Note: here "passthrough with event filters" means KVM intercepts EVENT_SELECT
and FIXED_CTR_CTRL MSRs to support KVM PMU filter feature for security, this
is current RFC implementation. In order to collect EVENT_SELECT interception
impact, we modified RFC source to passthrough all the MSRs into guest, this
is "all passthrough" in above table.

Conclusion:
1. passthrough vPMU has much better performance than emulated vPMU.
2. Intercept EVENT_SELECT and FIXED_CTR_CTRL MSRs cause 2% overhead.
3. As PMU context switch happens at VM-exit/entry, the more VM-exit,
the more vPMU overhead. This does not only impacts perf, but it also
impacts other benchmarks which have massive VM-exit like fio. We will
optimize this at the second phase of passthrough vPMU.

Remain Works
===
1. To reduce passthrough vPMU overhead, optimize the PMU context switch.
2. Add more PMU features like LBR, PEBS, perf metrics.
3. vPMU live migration.

Reference
===
1. https://lore.kernel.org/lkml/2db2ebbe-e552-b974-fc77-870d958465ba@xxxxxxxxx/
2. https://lkml.kernel.org/kvm/ZRRl6y1GL-7RM63x@xxxxxxxxxx/
3. https://lwn.net/Articles/932497/
4. https://lwn.net/Articles/924927/

Dapeng Mi (4):
  x86: Introduce MSR_CORE_PERF_GLOBAL_STATUS_SET for passthrough PMU
  KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU
  KVM: x86/pmu: Introduce macro PMU_CAP_PERF_METRICS
  KVM: x86/pmu: Clear PERF_METRICS MSR for guest

Kan Liang (2):
  perf: x86/intel: Support PERF_PMU_CAP_VPMU_PASSTHROUGH
  perf: Support guest enter/exit interfaces

Mingwei Zhang (22):
  perf: core/x86: Forbid PMI handler when guest own PMU
  perf: core/x86: Plumb passthrough PMU capability from x86_pmu to
    x86_pmu_cap
  KVM: x86/pmu: Introduce enable_passthrough_pmu module parameter and
    propage to KVM instance
  KVM: x86/pmu: Plumb through passthrough PMU to vcpu for Intel CPUs
  KVM: x86/pmu: Add a helper to check if passthrough PMU is enabled
  KVM: x86/pmu: Allow RDPMC pass through
  KVM: x86/pmu: Create a function prototype to disable MSR interception
  KVM: x86/pmu: Implement pmu function for Intel CPU to disable MSR
    interception
  KVM: x86/pmu: Intercept full-width GP counter MSRs by checking with
    perf capabilities
  KVM: x86/pmu: Whitelist PMU MSRs for passthrough PMU
  KVM: x86/pmu: Introduce PMU operation prototypes for save/restore PMU
    context
  KVM: x86/pmu: Introduce function prototype for Intel CPU to
    save/restore PMU context
  KVM: x86/pmu: Zero out unexposed Counters/Selectors to avoid
    information leakage
  KVM: x86/pmu: Add host_perf_cap field in kvm_caps to record host PMU
    capability
  KVM: x86/pmu: Exclude existing vLBR logic from the passthrough PMU
  KVM: x86/pmu: Make check_pmu_event_filter() an exported function
  KVM: x86/pmu: Allow writing to event selector for GP counters if event
    is allowed
  KVM: x86/pmu: Allow writing to fixed counter selector if counter is
    exposed
  KVM: x86/pmu: Introduce PMU helper to increment counter
  KVM: x86/pmu: Implement emulated counter increment for passthrough PMU
  KVM: x86/pmu: Separate passthrough PMU logic in set/get_msr() from
    non-passthrough vPMU
  KVM: nVMX: Add nested virtualization support for passthrough PMU

Xiong Zhang (13):
  perf: Set exclude_guest onto nmi_watchdog
  perf: core/x86: Add support to register a new vector for PMI handling
  KVM: x86/pmu: Register PMI handler for passthrough PMU
  perf: x86: Add function to switch PMI handler
  perf/x86: Add interface to reflect virtual LVTPC_MASK bit onto HW
  KVM: x86/pmu: Add get virtual LVTPC_MASK bit function
  KVM: x86/pmu: Manage MSR interception for IA32_PERF_GLOBAL_CTRL
  KVM: x86/pmu: Switch IA32_PERF_GLOBAL_CTRL at VM boundary
  KVM: x86/pmu: Switch PMI handler at KVM context switch boundary
  KVM: x86/pmu: Call perf_guest_enter() at PMU context switch
  KVM: x86/pmu: Add support for PMU context switch at VM-exit/enter
  KVM: x86/pmu: Intercept EVENT_SELECT MSR
  KVM: x86/pmu: Intercept FIXED_CTR_CTRL MSR

 arch/x86/events/core.c                   |  38 +++++
 arch/x86/events/intel/core.c             |   8 +
 arch/x86/events/perf_event.h             |   1 +
 arch/x86/include/asm/hardirq.h           |   1 +
 arch/x86/include/asm/idtentry.h          |   1 +
 arch/x86/include/asm/irq.h               |   1 +
 arch/x86/include/asm/irq_vectors.h       |   2 +-
 arch/x86/include/asm/kvm-x86-pmu-ops.h   |   3 +
 arch/x86/include/asm/kvm_host.h          |   8 +
 arch/x86/include/asm/msr-index.h         |   1 +
 arch/x86/include/asm/perf_event.h        |   4 +
 arch/x86/include/asm/vmx.h               |   1 +
 arch/x86/kernel/idt.c                    |   1 +
 arch/x86/kernel/irq.c                    |  29 ++++
 arch/x86/kvm/cpuid.c                     |   4 +
 arch/x86/kvm/lapic.h                     |   5 +
 arch/x86/kvm/pmu.c                       | 102 ++++++++++++-
 arch/x86/kvm/pmu.h                       |  37 ++++-
 arch/x86/kvm/vmx/capabilities.h          |   1 +
 arch/x86/kvm/vmx/nested.c                |  52 +++++++
 arch/x86/kvm/vmx/pmu_intel.c             | 186 +++++++++++++++++++++--
 arch/x86/kvm/vmx/vmx.c                   | 176 +++++++++++++++++----
 arch/x86/kvm/vmx/vmx.h                   |   3 +-
 arch/x86/kvm/x86.c                       |  37 ++++-
 arch/x86/kvm/x86.h                       |   2 +
 include/linux/perf_event.h               |  11 ++
 kernel/events/core.c                     | 179 ++++++++++++++++++++++
 kernel/watchdog_perf.c                   |   1 +
 tools/arch/x86/include/asm/irq_vectors.h |   1 +
 29 files changed, 852 insertions(+), 44 deletions(-)

base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
-- 
2.34.1