Re: [PATCH] KVM: x86/pmu: Introduce pmc->is_paused to reduce the call time of perf interfaces

Like Xu <like.xu.linux@xxxxxxxxx> · Thu, 29 Jul 2021 21:46:05 +0800

On 29/7/2021 8:58 pm, Peter Zijlstra wrote:
On Wed, Jul 28, 2021 at 08:07:05PM +0800, Like Xu wrote:
From: Like Xu <likexu@xxxxxxxxxxx>

Based on our observations, after any vm-exit associated with vPMU, there
are at least two or more perf interfaces to be called for guest counter
emulation, such as perf_event_{pause, read_value, period}(), and each one
will {lock, unlock} the same perf_event_ctx. The frequency of calls becomes
more severe when guest use counters in a multiplexed manner.

Holding a lock once and completing the KVM request operations in the perf
context would introduce a set of impractical new interfaces. So we can
further optimize the vPMU implementation by avoiding repeated calls to
these interfaces in the KVM context for at least one pattern:

After we call perf_event_pause() once, the event will be disabled and its
internal count will be reset to 0. So there is no need to pause it again
or read its value. Once the event is paused, event period will not be
updated until the next time it's resumed or reprogrammed. And there is
also no need to call perf_event_period twice for a non-running counter,
considering the perf_event for a running counter is never paused.

Based on this implementation, for the following common usage of
sampling 4 events using perf on a 4u8g guest:

   echo 0 > /proc/sys/kernel/watchdog
   echo 25 > /proc/sys/kernel/perf_cpu_time_max_percent
   echo 10000 > /proc/sys/kernel/perf_event_max_sample_rate
   echo 0 > /proc/sys/kernel/perf_cpu_time_max_percent
   for i in `seq 1 1 10`
   do
   taskset -c 0 perf record \
   -e cpu-cycles -e instructions -e branch-instructions -e cache-misses \
   /root/br_instr a
   done

the average latency of the guest NMI handler is reduced from
37646.7 ns to 32929.3 ns (~1.14x speed up) on the Intel ICX server.
Also, in addition to collecting more samples, no loss of sampling
accuracy was observed compared to before the optimization.

Signed-off-by: Like Xu <likexu@xxxxxxxxxxx>

Looks sane I suppose.

Acked-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>

What kinds of VM-exits are the most common?

A typical vm-exit trace is as follows:

 146820 EXTERNAL_INTERRUPT
 126301 MSR_WRITE
  17009 MSR_READ
   9710 RDPMC
   7295 EXCEPTION_NMI
   2493 EPT_VIOLATION
   1357 EPT_MISCONFIG
    567 CPUID
    107 NMI_WINDOW
     59 IO_INSTRUCTION
      2 VMCALL

including the following kvm_msr trace:

  15822 msr_write, MSR_CORE_PERF_GLOBAL_CTRL
  14558 msr_read, MSR_CORE_PERF_GLOBAL_STATUS
   7315 msr_write, IA32_X2APIC_LVT_PMI
   7250 msr_write, MSR_CORE_PERF_GLOBAL_OVF_CTRL
   2922 msr_write, MSR_IA32_PMC0
   2912 msr_write, MSR_CORE_PERF_FIXED_CTR0
   2904 msr_write, MSR_CORE_PERF_FIXED_CTR1
   2390 msr_write, MSR_CORE_PERF_FIXED_CTR_CTRL
   2390 msr_read, MSR_CORE_PERF_FIXED_CTR_CTRL
   1195 msr_write, MSR_P6_EVNTSEL1
   1195 msr_write, MSR_P6_EVNTSEL0
    976 msr_write, MSR_IA32_PMC1
    618 msr_write, IA32_X2APIC_ICR

Due to the presence of a large number of msr accesses, the latency of
the guest PMI handler is still far from that of the host handler.

I have two rough ideas that could drastically reduce the vPMU overhead
for the third time:

- Add a new paravirtualized pmu guest driver that saves all msr latest
values to the static physical memory of each vcpu to achieve a reduced
number of vm-exits and also facilitate kvm emulation access;

- Bypass the host perf_event PMI callback injection path and inject
guest PMI directly after the EXCEPTION_N/PMI vm-exit; (For TDX guest)

Any negative comments or help with additional details are welcome.