On 2024-06-11 8:06 a.m., Peter Zijlstra wrote: > On Mon, Jun 10, 2024 at 01:23:04PM -0400, Liang, Kan wrote: >> On 2024-05-07 4:58 a.m., Peter Zijlstra wrote: > >>> Bah, this is a ton of copy-paste from the normal scheduling code with >>> random changes. Why ? >>> >>> Why can't this use ctx_sched_{in,out}() ? Surely the whole >>> CAP_PASSTHROUGHT thing is but a flag away. >>> >> >> Not just a flag. The time has to be updated as well, since the ctx->time >> is shared among PMUs. Perf shouldn't stop it while other PMUs is still >> running. > > Obviously the original changelog didn't mention any of that.... :/ Yes, the time issue was a newly found bug when we test the uncore PMUs. > >> A timeguest will be introduced to track the start time of a guest. >> The event->tstamp of an exclude_guest event should always keep >> ctx->timeguest while a guest is running. >> When a guest is leaving, update the event->tstamp to now, so the guest >> time can be deducted. >> >> The below patch demonstrate how the timeguest works. >> (It's an incomplete patch. Just to show the idea.) >> >> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h >> index 22d3e56682c9..2134e6886e22 100644 >> --- a/include/linux/perf_event.h >> +++ b/include/linux/perf_event.h >> @@ -953,6 +953,7 @@ struct perf_event_context { >> u64 time; >> u64 timestamp; >> u64 timeoffset; >> + u64 timeguest; >> >> /* >> * These fields let us detect when two contexts have both >> diff --git a/kernel/events/core.c b/kernel/events/core.c >> index 14fd881e3e1d..2aed56671a24 100644 >> --- a/kernel/events/core.c >> +++ b/kernel/events/core.c >> @@ -690,12 +690,31 @@ __perf_update_times(struct perf_event *event, u64 >> now, u64 *enabled, u64 *runnin >> *running += delta; >> } >> >> +static void perf_event_update_time_guest(struct perf_event *event) >> +{ >> + /* >> + * If a guest is running, use the timestamp while entering the guest. >> + * If the guest is leaving, reset the event timestamp. >> + */ >> + if (!__this_cpu_read(perf_in_guest)) >> + event->tstamp = event->ctx->time; >> + else >> + event->tstamp = event->ctx->timeguest; >> +} > > This conditional seems inverted, without a good reason. Also, in another > thread you talk about some PMUs stopping time in a guest, while other > PMUs would keep ticking. I don't think the above captures that. > >> static void perf_event_update_time(struct perf_event *event) >> { >> - u64 now = perf_event_time(event); >> + u64 now; >> + >> + /* Never count the time of an active guest into an exclude_guest event. */ >> + if (event->ctx->timeguest && >> + event->pmu->capabilities & PERF_PMU_CAP_PASSTHROUGH_VPMU) >> + return perf_event_update_time_guest(event); > > Urgh, weird split. The PMU check is here. Please just inline the above > here, this seems to be the sole caller anyway. > Sure >> >> + now = perf_event_time(event); >> __perf_update_times(event, now, &event->total_time_enabled, >> &event->total_time_running); >> + >> event->tstamp = now; >> } >> >> @@ -3398,7 +3417,14 @@ ctx_sched_out(struct perf_event_context *ctx, >> enum event_type_t event_type) >> cpuctx->task_ctx = NULL; >> } >> >> - is_active ^= ctx->is_active; /* changed bits */ >> + if (event_type & EVENT_GUEST) { > > Patch doesn't introduce EVENT_GUEST, lost a hunk somewhere? Sorry, there is a prerequisite patch to factor out the EVENT_CGROUP. I thought it will be complex and confusion to paste both. Some details are lost. I will post both two patches at the end. > >> + /* >> + * Schedule out all !exclude_guest events of PMU >> + * with PERF_PMU_CAP_PASSTHROUGH_VPMU. >> + */ >> + is_active = EVENT_ALL; >> + } else >> + is_active ^= ctx->is_active; /* changed bits */ >> >> list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { >> if (perf_skip_pmu_ctx(pmu_ctx, event_type)) > >> @@ -5894,14 +5933,18 @@ void perf_guest_enter(u32 guest_lvtpc) >> } >> >> perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST); >> - ctx_sched_out(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST); >> + ctx_sched_out(&cpuctx->ctx, EVENT_GUEST); >> + /* Set the guest start time */ >> + cpuctx->ctx.timeguest = cpuctx->ctx.time; >> perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST); >> if (cpuctx->task_ctx) { >> perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST); >> - task_ctx_sched_out(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST); >> + task_ctx_sched_out(cpuctx->task_ctx, EVENT_GUEST); >> + cpuctx->task_ctx->timeguest = cpuctx->task_ctx->time; >> perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST); >> } >> >> __this_cpu_write(perf_in_guest, true); >> @@ -5925,14 +5968,17 @@ void perf_guest_exit(void) >> >> __this_cpu_write(perf_in_guest, false); >> >> perf_ctx_disable(&cpuctx->ctx, EVENT_GUEST); >> - ctx_sched_in(&cpuctx->ctx, EVENT_ALL | EVENT_GUEST); >> + ctx_sched_in(&cpuctx->ctx, EVENT_GUEST); >> + cpuctx->ctx.timeguest = 0; >> perf_ctx_enable(&cpuctx->ctx, EVENT_GUEST); >> if (cpuctx->task_ctx) { >> perf_ctx_disable(cpuctx->task_ctx, EVENT_GUEST); >> - ctx_sched_in(cpuctx->task_ctx, EVENT_ALL | EVENT_GUEST); >> + ctx_sched_in(cpuctx->task_ctx, EVENT_GUEST); >> + cpuctx->task_ctx->timeguest = 0; >> perf_ctx_enable(cpuctx->task_ctx, EVENT_GUEST); >> } > > I'm thinking EVENT_GUEST should cause the ->timeguest updates, no point > in having them explicitly duplicated here, hmm? We have to add a EVENT_GUEST check and update the ->timeguest at the end of the ctx_sched_out/in functions after the pmu_ctx_sched_out/in(). Because the ->timeguest also be used to check if it's leaving the guest in the perf_event_update_time(). Since the EVENT_GUEST only be used by perf_guest_enter/exit(), I thought maybe it's better to move it to where it is used rather than the generic ctx_sched_out/in(). It will minimize the impact on the non-virtualization user. Here are the two complete patches. The first one is to factor out the EVENT_CGROUP.