Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

"Liang, Kan" <kan.liang@xxxxxxxxxxxxxxx> · Fri, 26 Apr 2024 15:06:48 -0400

On 2024-04-26 2:41 p.m., Mingwei Zhang wrote:
>>> So this requires vcpu->pmu has two pieces of state information: 1) the
>>> flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
>>> just a boolean; phase #2, bitmap of occupied counters).
>>>
>>> This is a non-trivial optimization on the PMU context switch. I am
>>> thinking about splitting them into the following phases:
>>>
>>> 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
>>> for the 1st time.
>>> 2) fast PMU context switch on KVM side, i.e., KVM checking event
>>> selector value (enable/disable) and selectively switch PMU state
>>> (reducing rd/wr msrs)
>>> 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
>>> context switch boundary depending on existing active host-level
>>> events.
>>> 3.1) more accurate dynamic PMU context switch, ie., KVM checking
>>> host-level counter position and further reduces the number of msr
>>> accesses.
>>> 4) guest PMU context preemption, i.e., any new host-level perf
>>> profiling can immediately preempt the guest PMU in the vcpu loop
>>> (instead of waiting for the next PMU context switch in KVM).
>> I'm not quit sure about the 4.
>> The new host-level perf must be an exclude_guest event. It should not be
>> scheduled when a guest is using the PMU. Why do we want to preempt the
>> guest PMU? The current implementation in perf doesn't schedule any
>> exclude_guest events when a guest is running.
> right. The grey area is the code within the KVM_RUN loop, but
> _outside_ of the guest. This part of the code is on the "host" side.
> However, for efficiency reasons, KVM defers the PMU context switch by
> retaining the guest PMU MSR values within the loop. 

I assume you mean the optimization of moving the context switch from
VM-exit/entry boundary to the vCPU boundary.

> Optimization 4
> allows the host side to immediately profiling this part instead of
> waiting for vcpu to reach to PMU context switch locations. Doing so
> will generate more accurate results.

If so, I think the 4 is a must to have. Otherwise, it wouldn't honer the
definition of the exclude_guest. Without 4, it brings some random blind
spots, right?

> 
> Do we want to preempt that? I think it depends. For regular cloud
> usage, we don't. But for any other usages where we want to prioritize
> KVM/VMM profiling over guest vPMU, it is useful.
> 
> My current opinion is that optimization 4 is something nice to have.
> But we should allow people to turn it off just like we could choose to
> disable preempt kernel.

The exclude_guest means everything but the guest. I don't see a reason
why people want to turn it off and get some random blind spots.

Thanks,
Kan