On Thu, Nov 18, 2021 at 03:28:40PM -0500, Kenny Ho wrote: > @@ -245,6 +256,21 @@ static int compute_effective_progs(struct cgroup *cgrp, > if (!progs) > return -ENOMEM; > > + if (atype == CGROUP_TRACEPOINT) { > + /* TODO: only create event for cgroup that can have process */ > + > + attr.config = bpf_attach_subtype; > + attr.type = PERF_TYPE_TRACEPOINT; > + attr.sample_type = PERF_SAMPLE_RAW; > + attr.sample_period = 1; > + attr.wakeup_events = 1; > + > + rc = perf_event_create_for_all_cpus(&attr, cgrp, > + &cgrp->bpf.per_cg_events); > + if (rc) > + goto err; > + } ... > +int perf_event_create_for_all_cpus(struct perf_event_attr *attr, > + struct cgroup *cgroup, > + struct list_head *entries) > +{ > + struct perf_event **events; > + struct perf_cgroup *perf_cgrp; > + int cpu, i = 0; > + > + events = kzalloc(sizeof(struct perf_event *) * num_possible_cpus(), > + GFP_KERNEL); > + > + if (!events) > + return -ENOMEM; > + > + for_each_possible_cpu(cpu) { > + /* allocate first, connect the cgroup later */ > + events[i] = perf_event_create_kernel_counter(attr, cpu, NULL, NULL, NULL); This is a very heavy hammer for this task. There is really no need for perf_event to be created. Did you consider using raw_tp approach instead? It doesn't need this heavy stuff. Also I suspect in follow up you'd be adding tracepoints to GPU code? Did you consider just leaving few __weak global functions in GPU code and let bpf progs attach to them as fentry? I suspect the true hierarchical nature of bpf-cgroup framework isn't necessary. The bpf program itself can filter for given cgroup. We have bpf_current_task_under_cgroup() and friends. I suggest to sprinkle __weak empty funcs in GPU and see what you can do with it with fentry and bpf_current_task_under_cgroup. There is also bpf_get_current_ancestor_cgroup_id().