Re: [tip:perf/core] perf: Add cgroup support

Stephane Eranian <eranian@xxxxxxxxxx> · Thu, 17 Feb 2011 12:16:08 +0100

Peter,

On Wed, Feb 16, 2011 at 5:57 PM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> On Wed, 2011-02-16 at 13:46 +0000, tip-bot for Stephane Eranian wrote:
>> +static inline struct perf_cgroup *
>> +perf_cgroup_from_task(struct task_struct *task)
>> +{
>> + Â Â Â return container_of(task_subsys_state(task, perf_subsys_id),
>> + Â Â Â Â Â Â Â Â Â Â Â struct perf_cgroup, css);
>> +}
>
> ===================================================
> [ INFO: suspicious rcu_dereference_check() usage. ]
> ---------------------------------------------------
> include/linux/cgroup.h:547 invoked rcu_dereference_check() without protection!
> other info that might help us debug this:
> rcu_scheduler_active = 1, debug_locks = 1
> 1 lock held by perf/1774:
> Â#0: Â(&ctx->lock){......}, at: [<ffffffff810afb91>] ctx_sched_in+0x2a/0x37b
> stack backtrace:
> Pid: 1774, comm: perf Not tainted 2.6.38-rc5-tip+ #94017
> Call Trace:
> Â[<ffffffff81070932>] ? lockdep_rcu_dereference+0x9d/0xa5
> Â[<ffffffff810afc4e>] ? ctx_sched_in+0xe7/0x37b
> Â[<ffffffff810aff37>] ? perf_event_context_sched_in+0x55/0xa3
> Â[<ffffffff810b0203>] ? __perf_event_task_sched_in+0x20/0x5b
> Â[<ffffffff81035714>] ? finish_task_switch+0x49/0xf4
> Â[<ffffffff81340d60>] ? schedule+0x9cc/0xa85
> Â[<ffffffff8110a84c>] ? vfsmount_lock_global_unlock_online+0x9e/0xb0
> Â[<ffffffff8110b556>] ? mntput_no_expire+0x4e/0xc1
> Â[<ffffffff8110b5ef>] ? mntput+0x26/0x28
> Â[<ffffffff810f2add>] ? fput+0x1a0/0x1af
> Â[<ffffffff81002eb9>] ? int_careful+0xb/0x2c
> Â[<ffffffff813432bf>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> Â[<ffffffff81002ec7>] ? int_careful+0x19/0x2c
>
>
I have lockedp enabled in my kernel and during all my tests
I never saw this warning. How did you trigger this?

> The simple fix seemed to be to add:
>
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index a0a6987..e739e6f 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -204,7 +204,8 @@ __get_cpu_context(struct perf_event_context *ctx)
> Âstatic inline struct perf_cgroup *
> Âperf_cgroup_from_task(struct task_struct *task)
> Â{
> - Â Â Â return container_of(task_subsys_state(task, perf_subsys_id),
> + Â Â Â return container_of(task_subsys_state_check(task, perf_subsys_id,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â lockdep_is_held(&ctx->lock)),
> Â Â Â Â Â Â Â Â Â Â Â Âstruct perf_cgroup, css);
> Â}
>
> For all callers _should_ hold ctx->lock and ctx->lock is acquired during
> ->attach/->exit so holding that lock will pin the cgroup.
>
I am not sure I follow you here. Are you talking about cgroup_attach()
and cgroup_exit()? perf_cgroup_switch() does eventually grab ctx->lock
when it gets to the actual save and restore functions. But
perf_cgroup_from_task()
is called outside of those sections in perf_cgroup_switch().

> However, not all update_context_time()/update_cgrp_time_from_event()
> callers actually hold ctx->lock, which is a bug because that lock also
> serializes the timestamps.
>
> Most notably, task_clock_event_read(), which leads us to:
>

If the warning comes from invoking perf_cgroup_from_task(), then there is also
perf_cgroup_switch(). that one is not grabbing any ctx->lock either, but maybe
not on all paths.

> @@ -5794,9 +5795,14 @@ static void task_clock_event_read(struct perf_event *event)
> Â Â Â Âu64 time;
>
> Â Â Â Âif (!in_nmi()) {
> - Â Â Â Â Â Â Â update_context_time(event->ctx);
> + Â Â Â Â Â Â Â struct perf_event_context *ctx = event->ctx;
> + Â Â Â Â Â Â Â unsigned long flags;
> +
> + Â Â Â Â Â Â Â spin_lock_irqsave(&ctx->lock, flags);
> + Â Â Â Â Â Â Â update_context_time(ctx);
> Â Â Â Â Â Â Â Âupdate_cgrp_time_from_event(event);
> - Â Â Â Â Â Â Â time = event->ctx->time;
> + Â Â Â Â Â Â Â time = ctx->time;
> + Â Â Â Â Â Â Â spin_unlock_irqrestore(&ctx->lock, flags);
> Â Â Â Â} else {
> Â Â Â Â Â Â Â Âu64 now = perf_clock();
> Â Â Â Â Â Â Â Âu64 delta = now - event->ctx->timestamp;
>
>
> I then realized that the events themselves pin the cgroup, so its all
> cosmetic at best, but then I already had the below patch...
>
I assume by 'pin the group' you mean the cgroup cannot disappear
while there is at least one event pointing to it. That's is indeed true
thanks to refcounting (css_get()).

> Thoughts?
>
> ---
> Âkernel/perf_event.c | Â 30 ++++++++++++++++++------------
> Â1 files changed, 18 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index a0a6987..810ee49 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -202,9 +202,10 @@ __get_cpu_context(struct perf_event_context *ctx)
> Â#ifdef CONFIG_CGROUP_PERF
>
> Âstatic inline struct perf_cgroup *
> -perf_cgroup_from_task(struct task_struct *task)
> +perf_cgroup_from_task(struct task_struct *task, struct perf_event_context *ctx)
> Â{
> - Â Â Â return container_of(task_subsys_state(task, perf_subsys_id),
> + Â Â Â return container_of(task_subsys_state_check(task, perf_subsys_id,
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â lockdep_is_held(&ctx->lock)),
> Â Â Â Â Â Â Â Â Â Â Â Âstruct perf_cgroup, css);
> Â}
>
> @@ -268,7 +269,7 @@ static inline void update_cgrp_time_from_cpuctx(struct perf_cpu_context *cpuctx)
>
> Âstatic inline void update_cgrp_time_from_event(struct perf_event *event)
> Â{
> - Â Â Â struct perf_cgroup *cgrp = perf_cgroup_from_task(current);
> + Â Â Â struct perf_cgroup *cgrp = perf_cgroup_from_task(current, event->ctx);
> Â Â Â Â/*
> Â Â Â Â * do not update time when cgroup is not active
> Â Â Â Â */
> @@ -279,7 +280,7 @@ static inline void update_cgrp_time_from_event(struct perf_event *event)
> Â}
>
> Âstatic inline void
> -perf_cgroup_set_timestamp(struct task_struct *task, u64 now)
> +perf_cgroup_set_timestamp(struct task_struct *task, struct perf_event_context *ctx)
> Â{
> Â Â Â Âstruct perf_cgroup *cgrp;
> Â Â Â Âstruct perf_cgroup_info *info;
> @@ -287,9 +288,9 @@ perf_cgroup_set_timestamp(struct task_struct *task, u64 now)
> Â Â Â Âif (!task)
> Â Â Â Â Â Â Â Âreturn;
>
> - Â Â Â cgrp = perf_cgroup_from_task(task);
> + Â Â Â cgrp = perf_cgroup_from_task(task, ctx);
> Â Â Â Âinfo = this_cpu_ptr(cgrp->info);
> - Â Â Â info->timestamp = now;
> + Â Â Â info->timestamp = ctx->timestamp;
> Â}
>
> Â#define PERF_CGROUP_SWOUT Â Â Â0x1 /* cgroup switch out every event */
> @@ -349,7 +350,7 @@ void perf_cgroup_switch(struct task_struct *task, int mode)
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â * allow event_filter_match() to not
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â * have to pass task around
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â */
> - Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â cpuctx->cgrp = perf_cgroup_from_task(task);
> + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â cpuctx->cgrp = perf_cgroup_from_task(task, &cpuctx->ctx);
> Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Âcpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
> Â Â Â Â Â Â Â Â Â Â Â Â}
> Â Â Â Â Â Â Â Â}
> @@ -494,7 +495,7 @@ static inline int perf_cgroup_connect(pid_t pid, struct perf_event *event,
> Â}
>
> Âstatic inline void
> -perf_cgroup_set_timestamp(struct task_struct *task, u64 now)
> +perf_cgroup_set_timestamp(struct task_struct *task, struct perf_event_context *ctx)
> Â{
> Â}
>
> @@ -1613,7 +1614,7 @@ static int __perf_event_enable(void *info)
> Â Â Â Â/*
> Â Â Â Â * set current task's cgroup time reference point
> Â Â Â Â */
> - Â Â Â perf_cgroup_set_timestamp(current, perf_clock());
> + Â Â Â perf_cgroup_set_timestamp(current, ctx);
>
> Â Â Â Â__perf_event_mark_enabled(event, ctx);
>
> @@ -2048,7 +2049,7 @@ ctx_sched_in(struct perf_event_context *ctx,
>
> Â Â Â Ânow = perf_clock();
> Â Â Â Âctx->timestamp = now;
> - Â Â Â perf_cgroup_set_timestamp(task, now);
> + Â Â Â perf_cgroup_set_timestamp(task, ctx);
> Â Â Â Â/*
> Â Â Â Â * First go through the list and put on any pinned groups
> Â Â Â Â * in order to give them the best chance of going on.
> @@ -5794,9 +5795,14 @@ static void task_clock_event_read(struct perf_event *event)
> Â Â Â Âu64 time;
>
> Â Â Â Âif (!in_nmi()) {
> - Â Â Â Â Â Â Â update_context_time(event->ctx);
> + Â Â Â Â Â Â Â struct perf_event_context *ctx = event->ctx;
> + Â Â Â Â Â Â Â unsigned long flags;
> +
> + Â Â Â Â Â Â Â spin_lock_irqsave(&ctx->lock, flags);
> + Â Â Â Â Â Â Â update_context_time(ctx);
> Â Â Â Â Â Â Â Âupdate_cgrp_time_from_event(event);
> - Â Â Â Â Â Â Â time = event->ctx->time;
> + Â Â Â Â Â Â Â time = ctx->time;
> + Â Â Â Â Â Â Â spin_unlock_irqrestore(&ctx->lock, flags);
> Â Â Â Â} else {
> Â Â Â Â Â Â Â Âu64 now = perf_clock();
> Â Â Â Â Â Â Â Âu64 delta = now - event->ctx->timestamp;
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html