Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> writes: > On 16.06.2016 20:03, bsegall@xxxxxxxxxx wrote: >> Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> writes: >> >>> Cgroup created inside throttled group must inherit current throttle_count. >>> Broken throttle_count allows to nominate throttled entries as a next buddy, >>> later this leads to null pointer dereference in pick_next_task_fair(). >>> >>> This patch initialize cfs_rq->throttle_count at first enqueue: laziness >>> allows to skip locking all rq at group creation. Lazy approach also allows >>> to skip full sub-tree scan at throttling hierarchy (not in this patch). >>> >>> Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> >>> Cc: Stable <stable@xxxxxxxxxxxxxxx> # v3.2+ >>> --- >>> kernel/sched/fair.c | 19 +++++++++++++++++++ >>> kernel/sched/sched.h | 2 +- >>> 2 files changed, 20 insertions(+), 1 deletion(-) >>> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 218f8e83db73..fe809fe169d2 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -4185,6 +4185,25 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq) >>> if (!cfs_bandwidth_used()) >>> return; >>> >>> + /* synchronize hierarchical throttle counter */ >>> + if (unlikely(!cfs_rq->throttle_uptodate)) { >>> + struct rq *rq = rq_of(cfs_rq); >>> + struct cfs_rq *pcfs_rq; >>> + struct task_group *tg; >>> + >>> + cfs_rq->throttle_uptodate = 1; >>> + /* get closest uptodate node because leaves goes first */ >>> + for (tg = cfs_rq->tg->parent; tg; tg = tg->parent) { >>> + pcfs_rq = tg->cfs_rq[cpu_of(rq)]; >>> + if (pcfs_rq->throttle_uptodate) >>> + break; >>> + } >>> + if (tg) { >>> + cfs_rq->throttle_count = pcfs_rq->throttle_count; >>> + cfs_rq->throttled_clock_task = rq_clock_task(rq); >>> + } >>> + } >>> + >> >> Checking just in enqueue is not sufficient - throttled_lb_pair can check >> against a cfs_rq that has never been enqueued (and possibly other >> paths). > > Looks like this is minor problem: in worst case load-balancer will migrate > task into throttled hierarchy. And this could happens only once for each > newly created group. > >> >> It might also make sense to go ahead and initialize all the cfs_rqs we >> skipped over to avoid some n^2 pathological behavior. You could also use >> throttle_count == -1 or < 0. (We had our own version of this patch that >> I guess we forgot to push?) > > n^2 shouldn't be a problem while this happens only once for each > group. Yeah it's shouldn't be a problem, it's more a why not. > > throttle_count == -1 could be overwritten when parent throttles/unthrottles > before initialization. We could set it to INT_MIN/2 and check <0 but this > will hide possible bugs here. One more int in the same cacheline shouldn't > add noticeable overhead. Yeah, you would have to check in tg_throttle_up/tg_unthrottle_up as well to do this. > > I've also added this into our kernel to catch such problems without crash. > Probably it's worth to add into upstream because stale buddy is a real pain) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5506,8 +5506,11 @@ static void set_last_buddy(struct sched_entity *se) > if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE)) > return; > > - for_each_sched_entity(se) > + for_each_sched_entity(se) { > + if (WARN_ON_ONCE(!se->on_rq)) > + return; > cfs_rq_of(se)->last = se; > + } > } > > static void set_next_buddy(struct sched_entity *se) > @@ -5515,8 +5518,11 @@ static void set_next_buddy(struct sched_entity *se) > if (entity_is_task(se) && unlikely(task_of(se)->policy == SCHED_IDLE)) > return; > > - for_each_sched_entity(se) > + for_each_sched_entity(se) { > + if (WARN_ON_ONCE(!se->on_rq)) > + return; > cfs_rq_of(se)->next = se; > + } > } > > static void set_skip_buddy(struct sched_entity *se) > > >> >> >>> /* an active group must be handled by the update_curr()->put() path */ >>> if (!cfs_rq->runtime_enabled || cfs_rq->curr) >>> return; >>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h >>> index 72f1f3087b04..7cbeb92a1cb9 100644 >>> --- a/kernel/sched/sched.h >>> +++ b/kernel/sched/sched.h >>> @@ -437,7 +437,7 @@ struct cfs_rq { >>> >>> u64 throttled_clock, throttled_clock_task; >>> u64 throttled_clock_task_time; >>> - int throttled, throttle_count; >>> + int throttled, throttle_count, throttle_uptodate; >>> struct list_head throttled_list; >>> #endif /* CONFIG_CFS_BANDWIDTH */ >>> #endif /* CONFIG_FAIR_GROUP_SCHED */ -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html