<bikeshed> I think we should call this a mediated PMU, not a passthrough PMU. KVM still emulates the control plane (controls and event selectors), while the data is fully passed through (counters). </bikeshed> On Fri, Jan 26, 2024, Xiong Zhang wrote: > 1. host system wide / QEMU events handling during VM running > At VM-entry, all the host perf events which use host x86 PMU will be > stopped. These events with attr.exclude_guest = 1 will be stopped here > and re-started after vm-exit. These events without attr.exclude_guest=1 > will be in error state, and they cannot recovery into active state even > if the guest stops running. This impacts host perf a lot and request > host system wide perf events have attr.exclude_guest=1. > > This requests QEMU Process's perf event with attr.exclude_guest=1 also. > > During VM running, perf event creation for system wide and QEMU > process without attr.exclude_guest=1 fail with -EBUSY. > > 2. NMI watchdog > the perf event for NMI watchdog is a system wide cpu pinned event, it > will be stopped also during vm running, but it doesn't have > attr.exclude_guest=1, we add it in this RFC. But this still means NMI > watchdog loses function during VM running. > > Two candidates exist for replacing perf event of NMI watchdog: > a. Buddy hardlock detector[3] may be not reliable to replace perf event. > b. HPET-based hardlock detector [4] isn't in the upstream kernel. I think the simplest solution is to allow mediated PMU usage if and only if the NMI watchdog is disabled. Then whether or not the host replaces the NMI watchdog with something else becomes an orthogonal discussion, i.e. not KVM's problem to solve. > 3. Dedicated kvm_pmi_vector > In emulated vPMU, host PMI handler notify KVM to inject a virtual > PMI into guest when physical PMI belongs to guest counter. If the > same mechanism is used in passthrough vPMU and PMI skid exists > which cause physical PMI belonging to guest happens after VM-exit, > then the host PMI handler couldn't identify this PMI belongs to > host or guest. > So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest > has this vector only. The PMI belonging to host still has an NMI > vector. > > Without considering PMI skid especially for AMD, the host NMI vector > could be used for guest PMI also, this method is simpler and doesn't I don't see how multiplexing NMIs between guest and host is simpler. At best, the complexity is a wash, just in different locations, and I highly doubt it's a wash. AFAIK, there is no way to precisely know that an NMI came in via the LVTPC. E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue. SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX. > need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we > didn't meet the skid PMI issue on modern Intel processors. > > 4. per-VM passthrough mode configuration > Current RFC uses a KVM module enable_passthrough_pmu RO parameter, > it decides vPMU is passthrough mode or emulated mode at kvm module > load time. > Do we need the capability of per-VM passthrough mode configuration? > So an admin can launch some non-passthrough VM and profile these > non-passthrough VMs in host, but admin still cannot profile all > the VMs once passthrough VM existence. This means passthrough vPMU > and emulated vPMU mix on one platform, it has challenges to implement. > As the commit message in commit 0011, the main challenge is > passthrough vPMU and emulated vPMU have different vPMU features, this > ends up with two different values for kvm_cap.supported_perf_cap, which > is initialized at module load time. To support it, more refactor is > needed. I have no objection to an all-or-nothing setup. I'd honestly love to rip out the existing vPMU support entirely, but that's probably not be realistic, at least not in the near future. > Remain Works > === > 1. To reduce passthrough vPMU overhead, optimize the PMU context switch. Before this gets out of its "RFC" phase, I would at least like line of sight to a more optimized switch. I 100% agree that starting with a conservative implementation is the way to go, and the kernel absolutely needs to be able to profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the guest PMU loaded for the entirety of KVM_RUN isn't a viable option. But I also don't want to get into a situation where can't figure out a clean, robust way to do the optimized context switch without needing (another) massive rewrite.