Re: [RFC PATCH 23/41] KVM: x86/pmu: Implement the save/restore of PMU state for Intel CPU

"Mi, Dapeng" <dapeng1.mi@xxxxxxxxxxxxxxx> · Wed, 17 Jul 2024 11:41:41 +0800

On 4/26/2024 11:12 AM, Mingwei Zhang wrote:
> On Thu, Apr 25, 2024 at 6:46 PM Mi, Dapeng <dapeng1.mi@xxxxxxxxxxxxxxx> wrote:
>>
>> On 4/26/2024 5:46 AM, Sean Christopherson wrote:
>>> On Thu, Apr 25, 2024, Kan Liang wrote:
>>>> On 2024-04-25 4:16 p.m., Mingwei Zhang wrote:
>>>>> On Thu, Apr 25, 2024 at 9:13 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>>>>>> It should not happen. For the current implementation, perf rejects all
>>>>>> the !exclude_guest system-wide event creation if a guest with the vPMU
>>>>>> is running.
>>>>>> However, it's possible to create an exclude_guest system-wide event at
>>>>>> any time. KVM cannot use the information from the VM-entry to decide if
>>>>>> there will be active perf events in the VM-exit.
>>>>> Hmm, why not? If there is any exclude_guest system-wide event,
>>>>> perf_guest_enter() can return something to tell KVM "hey, some active
>>>>> host events are swapped out. they are originally in counter #2 and
>>>>> #3". If so, at the time when perf_guest_enter() returns, KVM will ack
>>>>> that and keep it in its pmu data structure.
>>>> I think it's possible that someone creates !exclude_guest event after
>>> I assume you mean an exclude_guest=1 event?  Because perf should be in a state
>>> where it rejects exclude_guest=0 events.
>> Suppose should be exclude_guest=1 event, the perf event without
>> exclude_guest attribute would be blocked to create in the v2 patches
>> which we are working on.
>>
>>
>>>> the perf_guest_enter(). The stale information is saved in the KVM. Perf
>>>> will schedule the event in the next perf_guest_exit(). KVM will not know it.
>>> Ya, the creation of an event on a CPU that currently has guest PMU state loaded
>>> is what I had in mind when I suggested a callback in my sketch:
>>>
>>>   :  D. Add a perf callback that is invoked from IRQ context when perf wants to
>>>   :     configure a new PMU-based events, *before* actually programming the MSRs,
>>>   :     and have KVM's callback put the guest PMU state
>>
>> when host creates a perf event with exclude_guest attribute which is
>> used to profile KVM/VMM user space, the vCPU process could work at three
>> places.
>>
>> 1. in guest state (non-root mode)
>>
>> 2. inside vcpu-loop
>>
>> 3. outside vcpu-loop
>>
>> Since the PMU state has already been switched to host state, we don't
>> need to consider the case 3 and only care about cases 1 and 2.
>>
>> when host creates a perf event with exclude_guest attribute to profile
>> KVM/VMM user space,  an IPI is triggered to enable the perf event
>> eventually like the following code shows.
>>
>> event_function_call(event, __perf_event_enable, NULL);
>>
>> For case 1,  a vm-exit is triggered and KVM starts to process the
>> vm-exit and then run IPI irq handler, exactly speaking
>> __perf_event_enable() to enable the perf event.
>>
>> For case 2, the IPI irq handler would preempt the vcpu-loop and call
>> __perf_event_enable() to enable the perf event.
>>
>> So IMO KVM just needs to provide a callback to switch guest/host PMU
>> state, and __perf_event_enable() calls this callback before really
>> touching PMU MSRs.
> ok, in this case, do we still need KVM to query perf if there are
> active exclude_guest events? yes? Because there is an ordering issue.
> The above suggests that the host-level perf profiling comes when a VM
> is already running, there is an IPI that can invoke the callback and
> trigger preemption. In this case, KVM should switch the context from
> guest to host. What if it is the other way around, ie., host-level
> profiling runs first and then VM runs?
>
> In this case, just before entering the vcpu loop, kvm should check
> whether there is an active host event and save that into a pmu data
> structure. If none, do the context switch early (so that KVM saves a
> huge amount of unnecessary PMU context switches in the future).
> Otherwise, keep the host PMU context until vm-enter. At the time of
> vm-exit, do the check again using the data stored in pmu structure. If
> there is an active event do the context switch to the host PMU,
> otherwise defer that until exiting the vcpu loop. Of course, in the
> meantime, if there is any perf profiling started causing the IPI, the
> irq handler calls the callback, preempting the guest PMU context. If
> that happens, at the time of exiting the vcpu boundary, PMU context
> switch is skipped since it is already done. Of course, note that the
> irq could come at any time, so the PMU context switch in all 4
> locations need to check the state flag (and skip the context switch if
> needed).
>
> So this requires vcpu->pmu has two pieces of state information: 1) the
> flag similar to TIF_NEED_FPU_LOAD; 2) host perf context info (phase #1
> just a boolean; phase #2, bitmap of occupied counters).
>
> This is a non-trivial optimization on the PMU context switch. I am
> thinking about splitting them into the following phases:
>
> 1) lazy PMU context switch, i.e., wait until the guest touches PMU MSR
> for the 1st time.
> 2) fast PMU context switch on KVM side, i.e., KVM checking event
> selector value (enable/disable) and selectively switch PMU state
> (reducing rd/wr msrs)
> 3) dynamic PMU context boundary, ie., KVM can dynamically choose PMU
> context switch boundary depending on existing active host-level
> events.
> 3.1) more accurate dynamic PMU context switch, ie., KVM checking
> host-level counter position and further reduces the number of msr
> accesses.
> 4) guest PMU context preemption, i.e., any new host-level perf
> profiling can immediately preempt the guest PMU in the vcpu loop
> (instead of waiting for the next PMU context switch in KVM).

Sorry to recall this discussion again. I want to update some latest
progress on it.

I have implemented a draft code for the optimization step 1 & 2 and done
some basic tests, but found there could be some security risks on the
optimization.

The initial proposal is sample, a counter bitmap 'pmc_bitmap64' is
introduced to dynamically record the guest using PMU counters. Only guest
using counters would be saved/cleared and restored at VM-Exit and VM-Entry.
This would reduce the overhead from MSR reading/writing.

But this introduces potential security risks. Since only partial MSRs are
save/restored to guest values. Malicious guest could get some host
sensitive information or attack host by reading or writing these counters
which guest doesn't use but host uses.

To mitigate these security risks, our initial thought is to change the
static interception configuration of PMU at vCPU creation to dynamic
interception configuration for per PMU counter before VM-entry. This can
mitigate the security risks from rdmsr/wrmsr instructions. But it still
can't address the security issue from rdpmc instruction as long as rdpmc is
set to passthrough mode. Malicious guest still can read the host sensitive
information by rdpmc instruction.

Possible solution could be

a. disable rdpmc passthrough

    This can solve the security hole, but rdpmc interception would bring
heavy performance overhead.

b. partially save/clear PMU MSRs at VM-Exit, but fully restore PMU MSRs at
VM-Entry

    This can fix the security issue as well, but suppose the performance
overhead won't be reduced too much since wrmsr brings more overhead.

c. Fall back to only optimization 1, no any PMU MSR save/restore and rdpmc
is set to interception by default if guest doesn't use any PMU counters.

    Suspect if it's a real scenario, at least for Linux, nmi_watchdog is
enabled by default and it would manipulated PMU counters.

>
> Thanks.
> -Mingwei
>>> It's a similar idea to TIF_NEED_FPU_LOAD, just that instead of a common chunk of
>>> kernel code swapping out the guest state (kernel_fpu_begin()), it's a callback
>>> into KVM.