>>> +static inline int perf_force_exclude_guest_check(struct perf_event *event, >>> + int cpu, struct task_struct *task) >>> +{ >>> + bool *force_exclude_guest = NULL; >>> + >>> + if (!has_vpmu_passthrough_cap(event->pmu)) >>> + return 0; >>> + >>> + if (event->attr.exclude_guest) >>> + return 0; >>> + >>> + if (cpu != -1) { >>> + force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu); >>> + } else if (task && (task->flags & PF_VCPU)) { >>> + /* >>> + * Just need to check the running CPU in the event creation. If the >>> + * task is moved to another CPU which supports the force_exclude_guest. >>> + * The event will filtered out and be moved to the error stage. See >>> + * merge_sched_in(). >>> + */ >>> + force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task)); >>> + } >> >> These checks are extremely racy, I don't see how this can possibly do the >> right thing. PF_VCPU isn't a "this is a vCPU task", it's a "this task is about >> to do VM-Enter, or just took a VM-Exit" (the "I'm a virtual CPU" comment in >> include/linux/sched.h is wildly misleading, as it's _only_ valid when accounting >> time slices). >> > > This is to reject an !exclude_guest event creation for a running > "passthrough" guest from host perf tool. > Could you please suggest a way to detect it via the struct task_struct? Here PF_VCPU is used to distinguish a perf event profiling userspace VMM process, like perf record -e {} -p $QEMU_PID. A lot of emails have discussed how to handle system wide perf event which has perf_event.attr.task == NULL. But perf event for user space VMM should be handled the same as system wide perf event, perf need a method to identify a process perf event is for user space VMM. PF_VCPU isn't the right one, then an open how to handle this ? thanks > > >> Digging deeper, I think __perf_force_exclude_guest has similar problems, e.g. >> perf_event_create_kernel_counter() calls perf_event_alloc() before acquiring the >> per-CPU context mutex. > > Do you mean that the perf_guest_enter() check could be happened right > after the perf_force_exclude_guest_check()? > It's possible. For this case, the event can still be created. It will be > treated as an existing event and handled in merge_sched_in(). It will > never be scheduled when a guest is running. > > The perf_force_exclude_guest_check() is to make sure most of the cases > can be rejected at the creation place. For the corner cases, they will > be rejected in the schedule stage. > >> >>> + if (force_exclude_guest && *force_exclude_guest) >>> + return -EBUSY; >>> + return 0; >>> +} >>> + >>> /* >>> * Holding the top-level event's child_mutex means that any >>> * descendant process that has inherited this event will block >>> @@ -11973,6 +12142,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu, >>> goto err_ns; >>> } >>> >>> + if (perf_force_exclude_guest_check(event, cpu, task)) { >> >> This should be: >> >> err = perf_force_exclude_guest_check(event, cpu, task); >> if (err) >> goto err_pmu; >> >> i.e. shouldn't effectively ignore/override the return result. >> > > Sure. > > Thanks, > Kan > >>> + err = -EBUSY; >>> + goto err_pmu; >>> + } >>> + >>> /* >>> * Disallow uncore-task events. Similarly, disallow uncore-cgroup >>> * events (they don't make sense as the cgroup will be different >>> -- >>> 2.34.1 >>> >