Re: [RFC PATCH 02/41] perf: Support guest enter/exit interfaces

"Zhang, Xiong Y" <xiong.y.zhang@xxxxxxxxxxxxxxx> · Fri, 26 Apr 2024 12:09:48 +0800



>>> +static inline int perf_force_exclude_guest_check(struct perf_event *event,
>>> +						 int cpu, struct task_struct *task)
>>> +{
>>> +	bool *force_exclude_guest = NULL;
>>> +
>>> +	if (!has_vpmu_passthrough_cap(event->pmu))
>>> +		return 0;
>>> +
>>> +	if (event->attr.exclude_guest)
>>> +		return 0;
>>> +
>>> +	if (cpu != -1) {
>>> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, cpu);
>>> +	} else if (task && (task->flags & PF_VCPU)) {
>>> +		/*
>>> +		 * Just need to check the running CPU in the event creation. If the
>>> +		 * task is moved to another CPU which supports the force_exclude_guest.
>>> +		 * The event will filtered out and be moved to the error stage. See
>>> +		 * merge_sched_in().
>>> +		 */
>>> +		force_exclude_guest = per_cpu_ptr(&__perf_force_exclude_guest, task_cpu(task));
>>> +	}
>>
>> These checks are extremely racy, I don't see how this can possibly do the
>> right thing.  PF_VCPU isn't a "this is a vCPU task", it's a "this task is about
>> to do VM-Enter, or just took a VM-Exit" (the "I'm a virtual CPU" comment in
>> include/linux/sched.h is wildly misleading, as it's _only_ valid when accounting
>> time slices).
>>
> 
> This is to reject an !exclude_guest event creation for a running
> "passthrough" guest from host perf tool.
> Could you please suggest a way to detect it via the struct task_struct?
Here PF_VCPU is used to distinguish a perf event profiling userspace VMM
process, like perf record -e {} -p $QEMU_PID. A lot of emails have
discussed how to handle system wide perf event which has
perf_event.attr.task == NULL. But perf event for user space VMM should be
handled the same as system wide perf event, perf need a method to identify
a process perf event is for user space VMM. PF_VCPU isn't the right one,
then an open how to handle this ?

thanks
> 
> 
>> Digging deeper, I think __perf_force_exclude_guest has similar problems, e.g.
>> perf_event_create_kernel_counter() calls perf_event_alloc() before acquiring the
>> per-CPU context mutex.
> 
> Do you mean that the perf_guest_enter() check could be happened right
> after the perf_force_exclude_guest_check()?
> It's possible. For this case, the event can still be created. It will be
> treated as an existing event and handled in merge_sched_in(). It will
> never be scheduled when a guest is running.
> 
> The perf_force_exclude_guest_check() is to make sure most of the cases
> can be rejected at the creation place. For the corner cases, they will
> be rejected in the schedule stage.
> 
>>
>>> +	if (force_exclude_guest && *force_exclude_guest)
>>> +		return -EBUSY;
>>> +	return 0;
>>> +}
>>> +
>>>  /*
>>>   * Holding the top-level event's child_mutex means that any
>>>   * descendant process that has inherited this event will block
>>> @@ -11973,6 +12142,11 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
>>>  		goto err_ns;
>>>  	}
>>>  
>>> +	if (perf_force_exclude_guest_check(event, cpu, task)) {
>>
>> This should be:
>>
>> 	err = perf_force_exclude_guest_check(event, cpu, task);
>> 	if (err)
>> 		goto err_pmu;
>>
>> i.e. shouldn't effectively ignore/override the return result.
>>
> 
> Sure.
> 
> Thanks,
> Kan
> 
>>> +		err = -EBUSY;
>>> +		goto err_pmu;
>>> +	}
>>> +
>>>  	/*
>>>  	 * Disallow uncore-task events. Similarly, disallow uncore-cgroup
>>>  	 * events (they don't make sense as the cgroup will be different
>>> -- 
>>> 2.34.1
>>>
>