Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

"Liang, Kan" <kan.liang@xxxxxxxxxxxxxxx> · Thu, 25 Apr 2024 16:43:39 -0400

On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>
>>
>>
>> On 2024-04-25 12:24 a.m., Mingwei Zhang wrote:
>>> On Wed, Apr 24, 2024 at 8:56 PM Mi, Dapeng <dapeng1.mi@xxxxxxxxxxxxxxx> wrote:
>>>>
>>>>
>>>> On 4/24/2024 11:00 PM, Sean Christopherson wrote:
>>>>> On Wed, Apr 24, 2024, Dapeng Mi wrote:
>>>>>> On 4/24/2024 1:02 AM, Mingwei Zhang wrote:
>>>>>>>>> Maybe, (just maybe), it is possible to do PMU context switch at vcpu
>>>>>>>>> boundary normally, but doing it at VM Enter/Exit boundary when host is
>>>>>>>>> profiling KVM kernel module. So, dynamically adjusting PMU context
>>>>>>>>> switch location could be an option.
>>>>>>>> If there are two VMs with pmu enabled both, however host PMU is not
>>>>>>>> enabled. PMU context switch should be done in vcpu thread sched-out path.
>>>>>>>>
>>>>>>>> If host pmu is used also, we can choose whether PMU switch should be
>>>>>>>> done in vm exit path or vcpu thread sched-out path.
>>>>>>>>
>>>>>>> host PMU is always enabled, ie., Linux currently does not support KVM
>>>>>>> PMU running standalone. I guess what you mean is there are no active
>>>>>>> perf_events on the host side. Allowing a PMU context switch drifting
>>>>>>> from vm-enter/exit boundary to vcpu loop boundary by checking host
>>>>>>> side events might be a good option. We can keep the discussion, but I
>>>>>>> won't propose that in v2.
>>>>>> I suspect if it's really doable to do this deferring. This still makes host
>>>>>> lose the most of capability to profile KVM. Per my understanding, most of
>>>>>> KVM overhead happens in the vcpu loop, exactly speaking in VM-exit handling.
>>>>>> We have no idea when host want to create perf event to profile KVM, it could
>>>>>> be at any time.
>>>>> No, the idea is that KVM will load host PMU state asap, but only when host PMU
>>>>> state actually needs to be loaded, i.e. only when there are relevant host events.
>>>>>
>>>>> If there are no host perf events, KVM keeps guest PMU state loaded for the entire
>>>>> KVM_RUN loop, i.e. provides optimal behavior for the guest.  But if a host perf
>>>>> events exists (or comes along), the KVM context switches PMU at VM-Enter/VM-Exit,
>>>>> i.e. lets the host profile almost all of KVM, at the cost of a degraded experience
>>>>> for the guest while host perf events are active.
>>>>
>>>> I see. So KVM needs to provide a callback which needs to be called in
>>>> the IPI handler. The KVM callback needs to be called to switch PMU state
>>>> before perf really enabling host event and touching PMU MSRs. And only
>>>> the perf event with exclude_guest attribute is allowed to create on
>>>> host. Thanks.
>>>
>>> Do we really need a KVM callback? I think that is one option.
>>>
>>> Immediately after VMEXIT, KVM will check whether there are "host perf
>>> events". If so, do the PMU context switch immediately. Otherwise, keep
>>> deferring the context switch to the end of vPMU loop.
>>>
>>> Detecting if there are "host perf events" would be interesting. The
>>> "host perf events" refer to the perf_events on the host that are
>>> active and assigned with HW counters and that are saved when context
>>> switching to the guest PMU. I think getting those events could be done
>>> by fetching the bitmaps in cpuc.
>>
>> The cpuc is ARCH specific structure. I don't think it can be get in the
>> generic code. You probably have to implement ARCH specific functions to
>> fetch the bitmaps. It probably won't worth it.
>>
>> You may check the pinned_groups and flexible_groups to understand if
>> there are host perf events which may be scheduled when VM-exit. But it
>> will not tell the idx of the counters which can only be got when the
>> host event is really scheduled.
>>
>>> I have to look into the details. But
>>> at the time of VMEXIT, kvm should already have that information, so it
>>> can immediately decide whether to do the PMU context switch or not.
>>>
>>> oh, but when the control is executing within the run loop, a
>>> host-level profiling starts, say 'perf record -a ...', it will
>>> generate an IPI to all CPUs. Maybe that's when we need a callback so
>>> the KVM guest PMU context gets preempted for the host-level profiling.
>>> Gah..
>>>
>>> hmm, not a fan of that. That means the host can poke the guest PMU
>>> context at any time and cause higher overhead. But I admit it is much
>>> better than the current approach.
>>>
>>> The only thing is that: any command like 'perf record/stat -a' shot in
>>> dark corners of the host can preempt guest PMUs of _all_ running VMs.
>>> So, to alleviate that, maybe a module parameter that disables this
>>> "preemption" is possible? This should fit scenarios where we don't
>>> want guest PMU to be preempted outside of the vCPU loop?
>>>
>>
>> It should not happen. For the current implementation, perf rejects all
>> the !exclude_guest system-wide event creation if a guest with the vPMU
>> is running.
>> However, it's possible to create an exclude_guest system-wide event at
>> any time. KVM cannot use the information from the VM-entry to decide if
>> there will be active perf events in the VM-exit.
> 
> Hmm, why not? If there is any exclude_guest system-wide event,
> perf_guest_enter() can return something to tell KVM "hey, some active
> host events are swapped out. they are originally in counter #2 and
> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
> that and keep it in its pmu data structure.

I think it's possible that someone creates !exclude_guest event after
the perf_guest_enter(). The stale information is saved in the KVM. Perf
will schedule the event in the next perf_guest_exit(). KVM will not know it.

> 
> Now, when doing context switching back to host at just VMEXIT, KVM
> will check this data and see if host perf context has something active
> (of course, they are all exclude_guest events). If not, deferring the
> context switch to vcpu boundary. Otherwise, do the proper PMU context
> switching by respecting the occupied counter positions on the host
> side, i.e., avoid doubling the work on the KVM side.
> 
> Kan, any suggestion on the above approach? 

I think we can only know the accurate event list at perf_guest_exit().
You may check the pinned_groups and flexible_groups, which tell if there
are candidate events.

> Totally understand that
> there might be some difficulty, since perf subsystem works in several
> layers and obviously fetching low-level mapping is arch specific work.
> If that is difficult, we can split the work in two phases: 1) phase
> #1, just ask perf to tell kvm if there are active exclude_guest events
> swapped out; 2) phase #2, ask perf to tell their (low-level) counter
> indices.
>

If you want an accurate counter mask, the changes in the arch specific
code is required. Two phases sound good to me.

Besides perf changes, I think the KVM should also track which counters
need to be saved/restored. The information can be get from the EventSel
interception.

Thanks,
Kan
>>
>> The perf_guest_exit() will reload the host state. It's impossible to
>> save the guest state after that. We may need a KVM callback. So perf can
>> tell KVM whether to save the guest state before perf reloads the host state.
>>
>> Thanks,
>> Kan
>>>>
>>>>
>>>>>
>>>>> My original sketch: https://lore.kernel.org/all/ZR3eNtP5IVAHeFNC@googlecom
>>>
>