Hi Peter, On Wed, Oct 4, 2023 at 9:02 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > On Tue, Oct 03, 2023 at 09:08:44PM -0700, Namhyung Kim wrote: > > > But after the change, it ended up iterating all pmus/events in the cpu > > context if there's a cgroup event somewhere on the cpu context. > > Unfortunately it includes uncore pmus which have much longer latency to > > control. > > Can you describe the problem in more detail please? Sure. > > We have cgrp as part of the tree key: {cpu, pmu, cgroup, idx}, > so it should be possible to find a specific cgroup for a cpu and or skip > to the next cgroup on that cpu in O(log n) time. This is about a single (core) pmu when it has a lot of events. But this problem is different, it's about accessing more pmus unnecessarily. Say we have the following events for CPU 0. sw: context-switches core: cycles, cycles-for-cgroup-A uncore: whatever The cpu context has a cgroup event so it needs to call perf_cgroup_switch() at every context switch. But actually it only needs to resched the 'core' pmu since it only has a cgroup event. Other pmu events (like context-switches or any uncore event) should not be bothered by that. But perf_cgroup_switch() calls the general functions which iterate all pmus in the (cpu) context. cpuctx.ctx.pmu_ctx_list: +-> sw -> core -> uncore (pmu_ctx_entry) Then it disables pmus, sched-out current events, switch cgroup pointer, sched-in new events and enable pmus. This gives a lot more overhead when it has uncore pmus since accessing MSRs for uncore pmus has longer latency. But uncore pmus cannot have cgroup events in the first place. So we need a separate list to keep pmus that have active cgroup events. cpuctx.cgrp_ctx_list: +-> core (cgrp_ctx_entry) And we also need a logic to do the same work only for this list. Hope this helps. > > > To fix the issue, I restored a linked list equivalent to cgrp_cpuctx_list > > in the perf_cpu_context and link perf_cpu_pmu_contexts that have cgroup > > events only. Also add new helpers to enable/disable and does ctx sched > > in/out for cgroups. > > Adding a list and duplicating the whole scheduling infrastructure seems > 'unfortunate' at best. Yeah, I know.. but I couldn't come up with a better solution. Thanks, Namhyung