Re: Per cgroup accounting of context switches

Yonghong Song <yhs@xxxxxx> · Sat, 7 Sep 2019 06:38:01 +0000

On 9/6/19 9:30 PM, Gautam Kulkarni wrote:
> Hi,
> 
> We are evaluating eBPF as a means to account voluntary and
> non-voluntary context switches against cgroups. Currently, this
> information is only present in the task_struct for an individual
> process and not in the cgroup data structure.
> 
> With this context, I was looking for recommendation on the following
> possible approaches:
> 
> 1. Use the existing tracepoint (trace_sched_switch) as it exists here
> with BPF_PROG_TYPE_TRACEPOINT:
> https://github.com/torvalds/linux/blob/master/kernel/sched/core.c#L3877
> However, based on the trace format, the kernel does not expose
> prev->nivcsw and prev->nvcsw. Due to this, I feel like this approach
> may not be feasible. Is my understanding correct?

You can use BPF_RAW_TRACEPOINT_OPEN and `prev` argument will
be available to bpf programs.

> 
> 2. Attach a kprobe to __schedule() and use BPF_PROG_TYPE_KPROBE
> This will allow us access to the prev pointer. From the prev
> (task_struct), we can access the cgroup and use an eBPF map to
> accumulate per cgroup counts of context switches.
> 
> 3. Implement a kernel module that attaches a kprobe to __schedule()
> and implement the map in the kprobe handler.
> 
> 4. Modify the kernel to have context switch information in task_group.
> Would this be something that would make sense to the community?
> 
> I would highly appreciate any feedback on this.
> 
> Regards,
> Gautam
>