On Thu, Apr 18, 2024, Mingwei Zhang wrote: > On Thu, Apr 11, 2024, Sean Christopherson wrote: > > <bikeshed> > > > > I think we should call this a mediated PMU, not a passthrough PMU. KVM still > > emulates the control plane (controls and event selectors), while the data is > > fully passed through (counters). > > > > </bikeshed> > Sean, > > I feel "mediated PMU" seems to be a little bit off the ..., no? In > KVM, almost all of features are mediated. In our specific case, the > legacy PMU is mediated by KVM and perf subsystem on the host. In new > design, it is mediated by KVM only. > > We intercept the control plan in current design, but the only thing > we do is the event filtering. No fancy code change to emulate the control > registers. So, it is still a passthrough logic. > > In some (rare) business cases, I think maybe we could fully passthrough > the control plan as well. For instance, sole-tenant machine, or > full-machine VM + full offload. In case if there is a cpu errata, KVM > can force vmexit and dynamically intercept the selectors on all vcpus > with filters checked. It is not supported in current RFC, but maybe > doable in later versions. > > With the above, I wonder if we can still use passthrough PMU for > simplicity? But no strong opinion if you really want to keep this name. > I would have to take some time to convince myself. > One propoal. Maybe "direct vPMU"? I think there would be many words that focus on the "passthrough" side but not on the "interception/mediation" side? > Thanks. > -Mingwei > > > > On Fri, Jan 26, 2024, Xiong Zhang wrote: > > > > > 1. host system wide / QEMU events handling during VM running > > > At VM-entry, all the host perf events which use host x86 PMU will be > > > stopped. These events with attr.exclude_guest = 1 will be stopped here > > > and re-started after vm-exit. These events without attr.exclude_guest=1 > > > will be in error state, and they cannot recovery into active state even > > > if the guest stops running. This impacts host perf a lot and request > > > host system wide perf events have attr.exclude_guest=1. > > > > > > This requests QEMU Process's perf event with attr.exclude_guest=1 also. > > > > > > During VM running, perf event creation for system wide and QEMU > > > process without attr.exclude_guest=1 fail with -EBUSY. > > > > > > 2. NMI watchdog > > > the perf event for NMI watchdog is a system wide cpu pinned event, it > > > will be stopped also during vm running, but it doesn't have > > > attr.exclude_guest=1, we add it in this RFC. But this still means NMI > > > watchdog loses function during VM running. > > > > > > Two candidates exist for replacing perf event of NMI watchdog: > > > a. Buddy hardlock detector[3] may be not reliable to replace perf event. > > > b. HPET-based hardlock detector [4] isn't in the upstream kernel. > > > > I think the simplest solution is to allow mediated PMU usage if and only if > > the NMI watchdog is disabled. Then whether or not the host replaces the NMI > > watchdog with something else becomes an orthogonal discussion, i.e. not KVM's > > problem to solve. > > > > > 3. Dedicated kvm_pmi_vector > > > In emulated vPMU, host PMI handler notify KVM to inject a virtual > > > PMI into guest when physical PMI belongs to guest counter. If the > > > same mechanism is used in passthrough vPMU and PMI skid exists > > > which cause physical PMI belonging to guest happens after VM-exit, > > > then the host PMI handler couldn't identify this PMI belongs to > > > host or guest. > > > So this RFC uses a dedicated kvm_pmi_vector, PMI belonging to guest > > > has this vector only. The PMI belonging to host still has an NMI > > > vector. > > > > > > Without considering PMI skid especially for AMD, the host NMI vector > > > could be used for guest PMI also, this method is simpler and doesn't > > > > I don't see how multiplexing NMIs between guest and host is simpler. At best, > > the complexity is a wash, just in different locations, and I highly doubt it's > > a wash. AFAIK, there is no way to precisely know that an NMI came in via the > > LVTPC. > > > > E.g. if an IPI NMI arrives before the host's PMU is loaded, confusion may ensue. > > SVM has the luxury of running with GIF=0, but that simply isn't an option on VMX. > > > > > need x86 subsystem to reserve the dedicated kvm_pmi_vector, and we > > > didn't meet the skid PMI issue on modern Intel processors. > > > > > > 4. per-VM passthrough mode configuration > > > Current RFC uses a KVM module enable_passthrough_pmu RO parameter, > > > it decides vPMU is passthrough mode or emulated mode at kvm module > > > load time. > > > Do we need the capability of per-VM passthrough mode configuration? > > > So an admin can launch some non-passthrough VM and profile these > > > non-passthrough VMs in host, but admin still cannot profile all > > > the VMs once passthrough VM existence. This means passthrough vPMU > > > and emulated vPMU mix on one platform, it has challenges to implement. > > > As the commit message in commit 0011, the main challenge is > > > passthrough vPMU and emulated vPMU have different vPMU features, this > > > ends up with two different values for kvm_cap.supported_perf_cap, which > > > is initialized at module load time. To support it, more refactor is > > > needed. > > > > I have no objection to an all-or-nothing setup. I'd honestly love to rip out the > > existing vPMU support entirely, but that's probably not be realistic, at least not > > in the near future. > > > > > Remain Works > > > === > > > 1. To reduce passthrough vPMU overhead, optimize the PMU context switch. > > > > Before this gets out of its "RFC" phase, I would at least like line of sight to > > a more optimized switch. I 100% agree that starting with a conservative > > implementation is the way to go, and the kernel absolutely needs to be able to > > profile KVM itself (and everything KVM calls into), i.e. _always_ keeping the > > guest PMU loaded for the entirety of KVM_RUN isn't a viable option. > > > > But I also don't want to get into a situation where can't figure out a clean, > > robust way to do the optimized context switch without needing (another) massive > > rewrite.