Re: [RFC PATCH 00/41] KVM: x86/pmu: Introduce passthrough vPM

"Zhang, Xiong Y" <xiong.y.zhang@xxxxxxxxxxxxxxx> · Fri, 12 Apr 2024 10:19:55 +0800



On 4/12/2024 1:03 AM, Sean Christopherson wrote:
> <bikeshed>
> 
> I think we should call this a mediated PMU, not a passthrough PMU.  KVM still
> emulates the control plane (controls and event selectors), while the data is
> fully passed through (counters).
> 
> </bikeshed>
> 
> On Fri, Jan 26, 2024, Xiong Zhang wrote:
> 
>> 1. host system wide / QEMU events handling during VM running
>>    At VM-entry, all the host perf events which use host x86 PMU will be
>>    stopped. These events with attr.exclude_guest = 1 will be stopped here
>>    and re-started after vm-exit. These events without attr.exclude_guest=1
>>    will be in error state, and they cannot recovery into active state even
>>    if the guest stops running. This impacts host perf a lot and request
>>    host system wide perf events have attr.exclude_guest=1.
>>
>>    This requests QEMU Process's perf event with attr.exclude_guest=1 also.
>>
>>    During VM running, perf event creation for system wide and QEMU
>>    process without attr.exclude_guest=1 fail with -EBUSY. 
>>
>> 2. NMI watchdog
>>    the perf event for NMI watchdog is a system wide cpu pinned event, it
>>    will be stopped also during vm running, but it doesn't have
>>    attr.exclude_guest=1, we add it in this RFC. But this still means NMI
>>    watchdog loses function during VM running.
>>
>>    Two candidates exist for replacing perf event of NMI watchdog:
>>    a. Buddy hardlock detector[3] may be not reliable to replace perf event.
>>    b. HPET-based hardlock detector [4] isn't in the upstream kernel.
> 
> I think the simplest solution is to allow mediated PMU usage if and only if
> the NMI watchdog is disabled.  Then whether or not the host replaces the NMI
> watchdog with something else becomes an orthogonal discussion, i.e. not KVM's
> problem to solve.
Make sense. KVM should not affect host high priority work.
NMI watchdog is a client of perf and is a system wide perf event, perf can't distinguish a system wide perf event is NMI watchdog or others, so how about we extend this suggestion to all the system wide perf events ?
mediated PMU is only allowed when all system wide perf events are disabled or non-exist at vm creation.
but NMI watchdog is usually enabled, this will limit mediated PMU usage.
> 
>> 3. Dedicated kvm_pmi_vector
>>    In emulated vPMU, host PMI handler notify KVM to inject a virtual
>>    PMI into guest when physical PMI belongs to guest counter. If the
>>    same mechanism is used in passthrough vPMU and PMI skid exists
>>    which cause physical PMI belonging to guest happens after VM-exit,
>>    then the host PMI handler couldn't identify this PMI belongs to
>>    host or guest.
>>    So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest
>>    has this vector only. The PMI belonging to host still has an NMI
>>    vector.
>>
>>    Without considering PMI skid especially for AMD, the host NMI vector
>>    could be used for guest PMI also, this method is simpler and doesn't
> 
> I don't see how multiplexing NMIs between guest and host is simpler.  At best,
> the complexity is a wash, just in different locations, and I highly doubt it's
> a wash.  AFAIK, there is no way to precisely know that an NMI came in via the
> LVTPC.
when kvm_intel.pt_mode=PT_MODE_HOST_GUEST, guest PT's PMI is a multiplexing NMI between guest and host, we could extend guest PT's PMI framework to mediated PMU. so I think this is simpler.
> 
> E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue.
> SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX.
> 
>>    need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we
>>    didn't meet the skid PMI issue on modern Intel processors.
>>
>> 4. per-VM passthrough mode configuration
>>    Current RFC uses a KVM module enable_passthrough_pmu RO parameter,
>>    it decides vPMU is passthrough mode or emulated mode at kvm module
>>    load time.
>>    Do we need the capability of per-VM passthrough mode configuration?
>>    So an admin can launch some non-passthrough VM and profile these
>>    non-passthrough VMs in host, but admin still cannot profile all
>>    the VMs once passthrough VM existence. This means passthrough vPMU
>>    and emulated vPMU mix on one platform, it has challenges to implement.
>>    As the commit message in commit 0011, the main challenge is 
>>    passthrough vPMU and emulated vPMU have different vPMU features, this
>>    ends up with two different values for kvm_cap.supported_perf_cap, which
>>    is initialized at module load time. To support it, more refactor is
>>    needed.
> 
> I have no objection to an all-or-nothing setup.  I'd honestly love to rip out the
> existing vPMU support entirely, but that's probably not be realistic, at least not
> in the near future.
> 
>> Remain Works
>> ===
>> 1. To reduce passthrough vPMU overhead, optimize the PMU context switch.
> 
> Before this gets out of its "RFC" phase, I would at least like line of sight to
> a more optimized switch.  I 100% agree that starting with a conservative
> implementation is the way to go, and the kernel absolutely needs to be able to
> profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the
> guest PMU loaded for the entirety of KVM_RUN isn't a viable option.
> 
> But I also don't want to get into a situation where can't figure out a clean,
> robust way to do the optimized context switch without needing (another) massive
> rewrite.
> 
Current PMU context switch happens at each vm-entry/exit, this impacts guest performance even if guest doesn't use PMU, as our first optimization, we will switch the PMU context only when guest really use PMU.

thanks