Alright, this prototype of "maybe we should just 100% avoid stranding runtime for a full period" appears to fix the fibtest synthetic example, and seems like a theoretically-reasonable approach. Things that may want improvement or at least thought (but it's a holiday week in the US and I wanted any opinions on the basic approach): - I don't like cfs_rq->slack_list, since logically it's mutually exclusive with throttled_list, but we have to iterate without locks, so I don't know that we can avoid it. - I previously was using _rcu for the slack list, like throttled, but there is no list_for_each_entry_rcu_safe, so the list_del_init would be invalid and we'd have to use another flag or opencode the equivalent. - (Actually, this just made me realize that distribute is sorta wrong if the unthrottled task immediately runs and rethrottles; this would just mean that we effectively restart the loop) - We unconditionally start the slack timer, even if nothing is throttled. We could instead have throttle start the timer in this case (setting the timeout some definition of "appropriately"), but this bookkeeping would be a big hassle. - We could try to do better about deciding what cfs_rqs are idle than "nr_running == 0", possibly requiring that to have been true for N<5 ms, and restarting the slack timer if we didn't clear everything. - Taking up-to-every rq->lock is bad and expensive and 5ms may be too short a delay for this. I haven't tried microbenchmarks on the cost of this vs min_cfs_rq_runtime = 0 vs baseline. - runtime_expires vs expires_seq choice is basically rand(), much like the existing code. (I think the most consistent with existing code would be runtime_expires, since cfs_b lock is held; I think most consistent in general would change most of the existing ones as well to be seq) -- >8 -- Subject: [PATCH] sched: avoid stranding cfs_bandwidth runtime We avoid contention on the per-tg cfs_b->lock by keeping 1ms of runtime on a cfs_rq even when all tasks in that cfs_rq dequeue. This way tasks doing frequent wake/sleep can't hit this cross-cpu lock more than once per ms. This however means that up to 1ms of runtime per cpu can be lost if no task does wake up on that cpu, which is leading to issues on cgroups with low quota, many available cpus, and a combination of threads that run for very little time and ones that want to run constantly. This was previously hidden by runtime expiration being broken, which allowed this stranded runtime to be kept indefinitely across period resets. Thus after an initial period or two any long-running tasks could use an appropriate portion of their group's quota. The issue was that the group could also potentially burst for 1ms * cpus more than their quota allowed, and in these situations this is a significant increase. Fix this by having the group's slack timer (which runs at most once per 5ms) remove all runtime from empty cfs_rqs, not just redistribute any runtime above that 1ms that was returned immediately. Signed-off-by: Ben Segall <bsegall@xxxxxxxxxx> --- kernel/sched/fair.c | 66 +++++++++++++++++++++++++++++++++++--------- kernel/sched/sched.h | 2 ++ 2 files changed, 55 insertions(+), 13 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 300f2c54dea5..80b2198d9b29 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4745,23 +4745,32 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime; - if (slack_runtime <= 0) + if (cfs_rq->runtime_remaining <= 0) + return; + + if (slack_runtime <= 0 && !list_empty(&cfs_rq->slack_list)) return; raw_spin_lock(&cfs_b->lock); if (cfs_b->quota != RUNTIME_INF && - cfs_rq->runtime_expires == cfs_b->runtime_expires) { - cfs_b->runtime += slack_runtime; + cfs_rq->expires_seq == cfs_b->expires_seq) { + if (slack_runtime > 0) + cfs_b->runtime += slack_runtime; + if (list_empty(&cfs_rq->slack_list)) + list_add(&cfs_rq->slack_list, &cfs_b->slack_cfs_rq); - /* we are under rq->lock, defer unthrottling using a timer */ - if (cfs_b->runtime > sched_cfs_bandwidth_slice() && - !list_empty(&cfs_b->throttled_cfs_rq)) - start_cfs_slack_bandwidth(cfs_b); + /* + * After a timeout, gather our remaining runtime so it can't get + * stranded. We need a timer anyways to distribute any of the + * runtime due to locking issues. + */ + start_cfs_slack_bandwidth(cfs_b); } raw_spin_unlock(&cfs_b->lock); - /* even if it's not valid for return we don't want to try again */ - cfs_rq->runtime_remaining -= slack_runtime; + if (slack_runtime > 0) + /* even if it's not valid for return we don't want to try again */ + cfs_rq->runtime_remaining -= slack_runtime; } static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) @@ -4781,12 +4790,41 @@ static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) */ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) { - u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); + u64 runtime = 0; unsigned long flags; u64 expires; + struct cfs_rq *cfs_rq, *temp; + LIST_HEAD(temp_head); + + local_irq_save(flags); + + raw_spin_lock(&cfs_b->lock); + cfs_b->slack_started = false; + list_splice_init(&cfs_b->slack_cfs_rq, &temp_head); + raw_spin_unlock(&cfs_b->lock); + + + /* Gather all left over runtime from all rqs */ + list_for_each_entry_safe(cfs_rq, temp, &temp_head, slack_list) { + struct rq *rq = rq_of(cfs_rq); + struct rq_flags rf; + + rq_lock(rq, &rf); + + raw_spin_lock(&cfs_b->lock); + list_del_init(&cfs_rq->slack_list); + if (!cfs_rq->nr_running && cfs_rq->runtime_remaining > 0 && + cfs_rq->runtime_expires == cfs_b->runtime_expires) { + cfs_b->runtime += cfs_rq->runtime_remaining; + cfs_rq->runtime_remaining = 0; + } + raw_spin_unlock(&cfs_b->lock); + + rq_unlock(rq, &rf); + } /* confirm we're still not at a refresh boundary */ - raw_spin_lock_irqsave(&cfs_b->lock, flags); + raw_spin_lock(&cfs_b->lock); cfs_b->slack_started = false; if (cfs_b->distribute_running) { raw_spin_unlock_irqrestore(&cfs_b->lock, flags); @@ -4798,7 +4836,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) return; } - if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) + if (cfs_b->quota != RUNTIME_INF) runtime = cfs_b->runtime; expires = cfs_b->runtime_expires; @@ -4946,6 +4984,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b) cfs_b->period = ns_to_ktime(default_cfs_period()); INIT_LIST_HEAD(&cfs_b->throttled_cfs_rq); + INIT_LIST_HEAD(&cfs_b->slack_cfs_rq); hrtimer_init(&cfs_b->period_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS_PINNED); cfs_b->period_timer.function = sched_cfs_period_timer; hrtimer_init(&cfs_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); @@ -4958,6 +4997,7 @@ static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq) { cfs_rq->runtime_enabled = 0; INIT_LIST_HEAD(&cfs_rq->throttled_list); + INIT_LIST_HEAD(&cfs_rq->slack_list); } void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b08dee29ef5e..3b272ee894fb 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -345,6 +345,7 @@ struct cfs_bandwidth { struct hrtimer period_timer; struct hrtimer slack_timer; struct list_head throttled_cfs_rq; + struct list_head slack_cfs_rq; /* Statistics: */ int nr_periods; @@ -566,6 +567,7 @@ struct cfs_rq { int throttled; int throttle_count; struct list_head throttled_list; + struct list_head slack_list; #endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_FAIR_GROUP_SCHED */ }; -- 2.22.0.410.gd8fdbe21b5-goog