On Thu, Feb 17, 2011 at 12:36 PM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote: > On Thu, 2011-02-17 at 12:16 +0100, Stephane Eranian wrote: >> Peter, >> >> On Wed, Feb 16, 2011 at 5:57 PM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote: >> > On Wed, 2011-02-16 at 13:46 +0000, tip-bot for Stephane Eranian wrote: >> >> +static inline struct perf_cgroup * >> >> +perf_cgroup_from_task(struct task_struct *task) >> >> +{ >> >> + Â Â Â return container_of(task_subsys_state(task, perf_subsys_id), >> >> + Â Â Â Â Â Â Â Â Â Â Â struct perf_cgroup, css); >> >> +} >> > >> > =================================================== >> > [ INFO: suspicious rcu_dereference_check() usage. ] >> > --------------------------------------------------- >> > include/linux/cgroup.h:547 invoked rcu_dereference_check() without protection! >> > other info that might help us debug this: >> > rcu_scheduler_active = 1, debug_locks = 1 >> > 1 lock held by perf/1774: >> > Â#0: Â(&ctx->lock){......}, at: [<ffffffff810afb91>] ctx_sched_in+0x2a/0x37b >> > stack backtrace: >> > Pid: 1774, comm: perf Not tainted 2.6.38-rc5-tip+ #94017 >> > Call Trace: >> > Â[<ffffffff81070932>] ? lockdep_rcu_dereference+0x9d/0xa5 >> > Â[<ffffffff810afc4e>] ? ctx_sched_in+0xe7/0x37b >> > Â[<ffffffff810aff37>] ? perf_event_context_sched_in+0x55/0xa3 >> > Â[<ffffffff810b0203>] ? __perf_event_task_sched_in+0x20/0x5b >> > Â[<ffffffff81035714>] ? finish_task_switch+0x49/0xf4 >> > Â[<ffffffff81340d60>] ? schedule+0x9cc/0xa85 >> > Â[<ffffffff8110a84c>] ? vfsmount_lock_global_unlock_online+0x9e/0xb0 >> > Â[<ffffffff8110b556>] ? mntput_no_expire+0x4e/0xc1 >> > Â[<ffffffff8110b5ef>] ? mntput+0x26/0x28 >> > Â[<ffffffff810f2add>] ? fput+0x1a0/0x1af >> > Â[<ffffffff81002eb9>] ? int_careful+0xb/0x2c >> > Â[<ffffffff813432bf>] ? trace_hardirqs_on_thunk+0x3a/0x3f >> > Â[<ffffffff81002ec7>] ? int_careful+0x19/0x2c >> > >> > >> I have lockedp enabled in my kernel and during all my tests >> I never saw this warning. How did you trigger this? > > CONFIG_PROVE_RCU=y, its a bit of a shiny feature but most of the false > positives are gone these days I think. > I have this one enabled, yet no message. >> > The simple fix seemed to be to add: >> > >> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c >> > index a0a6987..e739e6f 100644 >> > --- a/kernel/perf_event.c >> > +++ b/kernel/perf_event.c >> > @@ -204,7 +204,8 @@ __get_cpu_context(struct perf_event_context *ctx) >> > Âstatic inline struct perf_cgroup * >> > Âperf_cgroup_from_task(struct task_struct *task) >> > Â{ >> > - Â Â Â return container_of(task_subsys_state(task, perf_subsys_id), >> > + Â Â Â return container_of(task_subsys_state_check(task, perf_subsys_id, >> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â lockdep_is_held(&ctx->lock)), >> > Â Â Â Â Â Â Â Â Â Â Â Âstruct perf_cgroup, css); >> > Â} >> > >> > For all callers _should_ hold ctx->lock and ctx->lock is acquired during >> > ->attach/->exit so holding that lock will pin the cgroup. >> > >> I am not sure I follow you here. Are you talking about cgroup_attach() >> and cgroup_exit()? perf_cgroup_switch() does eventually grab ctx->lock >> when it gets to the actual save and restore functions. But >> perf_cgroup_from_task() >> is called outside of those sections in perf_cgroup_switch(). > > Right, but there we hold rcu_read_lock(). > > So what we're saying here is that its ok to dereference the variable > provided we hold either: > Â- rcu_read_lock > Â- task->alloc_lock > Â- cgroup_lock > > or > > Â- ctx->lock > > task->alloc_lock and cgroup_lock both avoid any changes to the current > task's cgroup due to kernel/cgroup.c locking. ctx->lock avoids this due > to us taking that lock in perf_cgroup_attach() and perf_cgroup_exit() > when this task is active. > We do not take ctx->lock in those functions (at least not directly). Both functions end up in perf_cgroup_switch() which does rcu_read_lock() for all its operations. ctx->lock becomes held once you get into ctx_sched_out() or ctx_sched_in(). But according to what you're saying above, that should cover it. >> > However, not all update_context_time()/update_cgrp_time_from_event() >> > callers actually hold ctx->lock, which is a bug because that lock also >> > serializes the timestamps. >> > >> > Most notably, task_clock_event_read(), which leads us to: >> > >> >> If the warning comes from invoking perf_cgroup_from_task(), then there is also >> perf_cgroup_switch(). that one is not grabbing any ctx->lock either, but maybe >> not on all paths. >> >> > @@ -5794,9 +5795,14 @@ static void task_clock_event_read(struct perf_event *event) >> > Â Â Â Âu64 time; >> > >> > Â Â Â Âif (!in_nmi()) { >> > - Â Â Â Â Â Â Â update_context_time(event->ctx); >> > + Â Â Â Â Â Â Â struct perf_event_context *ctx = event->ctx; >> > + Â Â Â Â Â Â Â unsigned long flags; >> > + >> > + Â Â Â Â Â Â Â spin_lock_irqsave(&ctx->lock, flags); >> > + Â Â Â Â Â Â Â update_context_time(ctx); >> > Â Â Â Â Â Â Â Âupdate_cgrp_time_from_event(event); >> > - Â Â Â Â Â Â Â time = event->ctx->time; >> > + Â Â Â Â Â Â Â time = ctx->time; >> > + Â Â Â Â Â Â Â spin_unlock_irqrestore(&ctx->lock, flags); >> > Â Â Â Â} else { >> > Â Â Â Â Â Â Â Âu64 now = perf_clock(); >> > Â Â Â Â Â Â Â Âu64 delta = now - event->ctx->timestamp; > > I just thought we should probably kill the !in_nmi branch, I'm not quite > sure why that exists.. I don't quite understand what this event is supposed to count in system-wide mode. This function adds a time delta. It may be using the wrong time source in cgroup mode. Having said that, it seems to me like we may not even need the call to update_cgrp_time_from_event() there. It is not even used to compute the time delta in that function. Yet, we do get correct timings in cgroup mode. Thus, I suspect the timing is taken care by callers already whenever needed. I looked at the pmu->read() callers, and it seems they do exactly that. In summary, I believe we may be able to drop this call. > >> > I then realized that the events themselves pin the cgroup, so its all >> > cosmetic at best, but then I already had the below patch... >> > >> I assume by 'pin the group' you mean the cgroup cannot disappear >> while there is at least one event pointing to it. That's is indeed true >> thanks to refcounting (css_get()). > > Right, that's what I was thinking, but now I think that's not > sufficient, we can have cgroups without events but with tasks in for > which the races are still valid. > But in that case, no perf_event code should be fiddling with cgroups. I think there are guards for that, either is_cgroup_event() or ctx->nr_cgroups. But it seems perf_cgroup_from_event() is the one exception. So maybe we could rewrite it: static inline void update_cgrp_time_from_event(struct perf_event *event) { struct perf_cgroup *cgrp; if (!is_cgroup_event(event)) return; cgrp = perf_cgroup_from_task(current); /* * do not update time when cgroup is not active */ if (cgrp != event->cgrp) return; __update_cgrp_time(event->cgrp); } > Also: > > --- > diff --git a/kernel/perf_event.c b/kernel/perf_event.c > index a0a6987..ab28e56 100644 > --- a/kernel/perf_event.c > +++ b/kernel/perf_event.c > @@ -7330,12 +7330,10 @@ static struct cgroup_subsys_state *perf_cgroup_create( > Â Â Â Âstruct perf_cgroup_info *t; > Â Â Â Âint c; > > - Â Â Â jc = kmalloc(sizeof(*jc), GFP_KERNEL); > + Â Â Â jc = kzalloc(sizeof(*jc), GFP_KERNEL); > Â Â Â Âif (!jc) > Â Â Â Â Â Â Â Âreturn ERR_PTR(-ENOMEM); > > - Â Â Â memset(jc, 0, sizeof(*jc)); > - > Â Â Â Âjc->info = alloc_percpu(struct perf_cgroup_info); > Â Â Â Âif (!jc->info) { > Â Â Â Â Â Â Â Âkfree(jc); > Yep. -- To unsubscribe from this list: send the line "unsubscribe linux-tip-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html
![]() |