On Thu, Apr 11, 2024, Kan Liang wrote: > >> +/* > >> + * When a guest enters, force all active events of the PMU, which supports > >> + * the VPMU_PASSTHROUGH feature, to be scheduled out. The events of other > >> + * PMUs, such as uncore PMU, should not be impacted. The guest can > >> + * temporarily own all counters of the PMU. > >> + * During the period, all the creation of the new event of the PMU with > >> + * !exclude_guest are error out. > >> + */ > >> +void perf_guest_enter(void) > >> +{ > >> + struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context); > >> + > >> + lockdep_assert_irqs_disabled(); > >> + > >> + if (__this_cpu_read(__perf_force_exclude_guest)) > > > > This should be a WARN_ON_ONCE, no? > > To debug the improper behavior of KVM? Not so much "debug" as ensure that the platform owner noticies that KVM is buggy. > >> +static inline int perf_force_exclude_guest_check(struct perf_event *event, > >> + int cpu, struct task_struct *task) > >> +{ > >> + bool *force_exclude_guest = NULL; > >> + > >> + if (!has_vpmu_passthrough_cap(event->pmu)) > >> + return 0; > >> + > >> + if (event->attr.exclude_guest) > >> + return 0; > >> + > >> + if (cpu != -1) { > >> + force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu); > >> + } else if (task && (task->flags & PF_VCPU)) { > >> + /* > >> + * Just need to check the running CPU in the event creation. If the > >> + * task is moved to another CPU which supports the force_exclude_guest. > >> + * The event will filtered out and be moved to the error stage. See > >> + * merge_sched_in(). > >> + */ > >> + force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task)); > >> + } > > > > These checks are extremely racy, I don't see how this can possibly do the > > right thing. PF_VCPU isn't a "this is a vCPU task", it's a "this task is about > > to do VM-Enter, or just took a VM-Exit" (the "I'm a virtual CPU" comment in > > include/linux/sched.h is wildly misleading, as it's _only_ valid when accounting > > time slices). > > > > This is to reject an !exclude_guest event creation for a running > "passthrough" guest from host perf tool. > Could you please suggest a way to detect it via the struct task_struct? > > > Digging deeper, I think __perf_force_exclude_guest has similar problems, e.g. > > perf_event_create_kernel_counter() calls perf_event_alloc() before acquiring the > > per-CPU context mutex. > > Do you mean that the perf_guest_enter() check could be happened right > after the perf_force_exclude_guest_check()? > It's possible. For this case, the event can still be created. It will be > treated as an existing event and handled in merge_sched_in(). It will > never be scheduled when a guest is running. > > The perf_force_exclude_guest_check() is to make sure most of the cases > can be rejected at the creation place. For the corner cases, they will > be rejected in the schedule stage. Ah, the "rejected in the schedule stage" is what I'm missing. But that creates a gross ABI, because IIUC, event creation will "randomly" succeed based on whether or not a CPU happens to be running in a KVM guest. I.e. it's not just the kernel code that has races, the entire event creation is one big race. What if perf had a global knob to enable/disable mediate PMU support? Then when KVM is loaded with enable_mediated_true, call into perf to (a) check that there are no existing !exclude_guest events (this part could be optional), and (b) set the global knob to reject all new !exclude_guest events (for the core PMU?). Hmm, or probably better, do it at VM creation. That has the advantage of playing nice with CONFIG_KVM=y (perf could reject the enabling without completely breaking KVM), and not causing problems if KVM is auto-probed but the user doesn't actually want to run VMs. E.g. (very roughly) int x86_perf_get_mediated_pmu(void) { if (refcount_inc_not_zero(...)) return 0; if (<system wide events>) return -EBUSY; <slow path with locking> } void x86_perf_put_mediated_pmu(void) { if (!refcount_dec_and_test(...)) return; <slow path with locking> } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 1bbf312cbd73..f2994377ef44 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -12467,6 +12467,12 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) if (type) return -EINVAL; + if (enable_mediated_pmu) + ret = x86_perf_get_mediated_pmu(); + if (ret) + return ret; + } + ret = kvm_page_track_init(kvm); if (ret) goto out; @@ -12518,6 +12524,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) kvm_mmu_uninit_vm(kvm); kvm_page_track_cleanup(kvm); out: + x86_perf_put_mediated_pmu(); return ret; } @@ -12659,6 +12666,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm) kvm_page_track_cleanup(kvm); kvm_xen_destroy_vm(kvm); kvm_hv_destroy_vm(kvm); + x86_perf_put_mediated_pmu(); } static void memslot_rmap_free(struct kvm_memory_slot *slot)