Re: [PATCH v2 07/54] perf: Add generic exclude_guest support

"Liang, Kan" <kan.liang@xxxxxxxxxxxxxxx> · Tue, 11 Jun 2024 09:27:46 -0400

On 2024-06-11 8:06 a.m., Peter Zijlstra wrote:
> On Mon, Jun 10, 2024 at 01:23:04PM -0400, Liang, Kan wrote:
>> On 2024-05-07 4:58 a.m., Peter Zijlstra wrote:
> 
>>> Bah, this is a ton of copy-paste from the normal scheduling code with
>>> random changes. Why ?
>>>
>>> Why can't this use ctx_sched_{in,out}() ? Surely the whole
>>> CAP_PASSTHROUGHT thing is but a flag away.
>>>
>>
>> Not just a flag. The time has to be updated as well, since the ctx->time
>> is shared among PMUs. Perf shouldn't stop it while other PMUs is still
>> running.
> 
> Obviously the original changelog didn't mention any of that.... :/

Yes, the time issue was a newly found bug when we test the uncore PMUs.

> 
>> A timeguest will be introduced to track the start time of a guest.
>> The event->tstamp of an exclude_guest event should always keep
>> ctx->timeguest while a guest is running.
>> When a guest is leaving, update the event->tstamp to now, so the guest
>> time can be deducted.
>>
>> The below patch demonstrate how the timeguest works.
>> (It's an incomplete patch. Just to show the idea.)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index 22d3e56682c9..2134e6886e22 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -953,6 +953,7 @@ struct perf_event_context {
>>  	u64				time;
>>  	u64				timestamp;
>>  	u64				timeoffset;
>> +	u64				timeguest;
>>
>>  	/*
>>  	 * These fields let us detect when two contexts have both
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 14fd881e3e1d..2aed56671a24 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -690,12 +690,31 @@ __perf_update_times(struct perf_event *event, u64
>> now, u64 *enabled, u64 *runnin
>>  		*running += delta;
>>  }
>>
>> +static void perf_event_update_time_guest(struct perf_event *event)
>> +{
>> +	/*
>> +	 * If a guest is running, use the timestamp while entering the guest.
>> +	 * If the guest is leaving, reset the event timestamp.
>> +	 */
>> +	if (!__this_cpu_read(perf_in_guest))
>> +		event->tstamp = event->ctx->time;
>> +	else
>> +		event->tstamp = event->ctx->timeguest;
>> +}
> 
> This conditional seems inverted, without a good reason. Also, in another
> thread you talk about some PMUs stopping time in a guest, while other
> PMUs would keep ticking. I don't think the above captures that.
> 
>>  static void perf_event_update_time(struct perf_event *event)
>>  {
>> -	u64 now = perf_event_time(event);
>> +	u64 now;
>> +
>> +	/* Never count the time of an active guest into an exclude_guest event. */
>> +	if (event->ctx->timeguest &&
>> +	    event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU)
>> +		return perf_event_update_time_guest(event);
> 
> Urgh, weird split. The PMU check is here. Please just inline the above
> here, this seems to be the sole caller anyway.
>

Sure

>>
>> +	now = perf_event_time(event);
>>  	__perf_update_times(event, now, &event->total_time_enabled,
>>  					&event->total_time_running);
>> +
>>  	event->tstamp = now;
>>  }
>>
>> @@ -3398,7 +3417,14 @@ ctx_sched_out(struct perf_event_context *ctx,
>> enum event_type_t event_type)
>>  			cpuctx->task_ctx = NULL;
>>  	}
>>
>> -	is_active ^= ctx->is_active; /* changed bits */
>> +	if (event_type & EVENT_GUEST) {
> 
> Patch doesn't introduce EVENT_GUEST, lost a hunk somewhere?

Sorry, there is a prerequisite patch to factor out the EVENT_CGROUP.
I thought it will be complex and confusion to paste both. Some details
are lost.
I will post both two patches at the end.

> 
>> +		/*
>> +		 * Schedule out all !exclude_guest events of PMU
>> +		 * with PERF_PMU_CAP_PASSTHROUGH_VPMU.
>> +		 */
>> +		is_active = EVENT_ALL;
>> +	} else
>> +		is_active ^= ctx->is_active; /* changed bits */
>>
>>  	list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) {
>>  		if (perf_skip_pmu_ctx(pmu_ctx, event_type))
> 
>> @@ -5894,14 +5933,18 @@ void perf_guest_enter(u32 guest_lvtpc)
>>  	}
>>
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>> -	ctx_sched_out(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST);
>> +	ctx_sched_out(&cpuctx->ctx, EVENT_GUEST);
>> +	/* Set the guest start time */
>> +	cpuctx->ctx.timeguest = cpuctx->ctx.time;
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>  	if (cpuctx->task_ctx) {
>>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>> -		task_ctx_sched_out(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST);
>> +		task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST);
>> +		cpuctx->task_ctx->timeguest = cpuctx->task_ctx->time;
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
>>
>>  	__this_cpu_write(perf_in_guest, true);
>> @@ -5925,14 +5968,17 @@ void perf_guest_exit(void)
>>
>>  	__this_cpu_write(perf_in_guest, false);
>>
>>  	perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST);
>> -	ctx_sched_in(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST);
>> +	ctx_sched_in(&cpuctx->ctx, EVENT_GUEST);
>> +	cpuctx->ctx.timeguest = 0;
>>  	perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST);
>>  	if (cpuctx->task_ctx) {
>>  		perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST);
>> -		ctx_sched_in(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST);
>> +		ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST);
>> +		cpuctx->task_ctx->timeguest = 0;
>>  		perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST);
>>  	}
> 
> I'm thinking EVENT_GUEST should cause the ->timeguest updates, no point
> in having them explicitly duplicated here, hmm?

We have to add a EVENT_GUEST check and update the ->timeguest at the end
of the ctx_sched_out/in functions after the pmu_ctx_sched_out/in().
Because the ->timeguest also be used to check if it's leaving the guest
in the perf_event_update_time().

Since the EVENT_GUEST only be used by perf_guest_enter/exit(), I thought
maybe it's better to move it to where it is used rather than the generic
ctx_sched_out/in(). It will minimize the impact on the
non-virtualization user.

Here are the two complete patches.

The first one is to factor out the EVENT_CGROUP.