I'd really appreciate some attention on this. Should I have marked the subject as sched: instead? I heard through some back-channels that there may be concern with the ability to use more cpu than allocated in a given period. To that I say, #1 The current behavior of an application hitting cpu throttling while simultaneously accounting for much less cpu time than was allocated is a very poor user experience. i.e. granted .5 cpu, but only used .1 cpu while simultaneously hitting throttling. #2 This has been broken like this since at least 3.16-rc1 which is why I ripped out most of the logic instead of trying to patch it again. I proved this experimentally by adding a counter in expire_cfs_rq_runtime when runtime is expired. I can share the patches if that would help. That means that user-interactive applications have been able to over-use quota in a similar manner since June 2014, and no one has noticed or complained. Now that the mechanism is "fixed" people are starting to notice and they are complaining loudly. See https://github.com/kubernetes/kubernetes/issues/67577 and the many linked tickets to that one. #3 Even though it's true that you can use more cpu than allocated in a period, that would require that you under-use quota in previous periods equal to the overage. In effect you are still enforcing the quota requirements albeit over longer time-frames than cfs_period_us *(I'm amenable to a documentation update to fix this nuance). Additionally any single cpu run queue can only over-use by as much as sched_cfs_bandwidth_slice_us which defaults to 5ms. So other applications on the same processor will at most be hindered by that amount. #4 cpu-bound applications will not be able to over-use in any period, as the entirety of their quota will be consumed every period. Your review would be much appreciated. Thank you, Dave On Wed, Apr 10, 2019 at 5:21 PM Dave Chiluk <chiluk+linux@xxxxxxxxxx> wrote: > > It has been observed, that highly-threaded, non-cpu-bound applications > running under cpu.cfs_quota_us constraints can hit a high percentage of > periods throttled while simultaneously not consuming the allocated > amount of quota. This use case is typical of user-interactive non-cpu > bound web services, such as those running in kubernetes or mesos. > > This has been root caused to threads being allocated per cpu bandwidth > slices, and then not fully using that slice within the period, and then > having that quota expire. This constant expiration of unused quota > results applications not being able to utilize the quota for which they > are allocated. > > The expiration of quota was recently fixed by commit 512ac999d275 > ("sched/fair: Fix bandwidth timer clock drift condition"). Prior to that > it appears that this has been broken since a least commit 51f2176d74ac > ("sched/fair: Fix unlocked reads of some cfs_b->quota/period") which was > introduced in v3.16-rc1 in 2014. That commit added the following > testcase which resulted in runtime never being expired. > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > /* extend local deadline, drift is bounded above by 2 ticks */ > cfs_rq->runtime_expires += TICK_NSEC; > > Because this was broken for nearly 5 years, and has recently been fixed > and is now being noticed by many users running kubernetes > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > that the mechanisms around expiring runtime should be removed > altogether. > > This allows quota runtime slices allocated to per-cpu runqueues to live > longer than the period boundary. This allows threads on runqueues that > do not use much CPU to continue to use their remaining slice over a > longer period of time than cpu.cfs_period_us. However, this helps > prevents the above condition of hitting throttling while also not fully > utilizing your cpu quota. > > This theoretically allows a machine to use slightly more than it's > allotted quota in some periods. This overflow would be equal to the > amount of quota that was left un-used on cfs_rq's in the previous > period. For CPU bound tasks this will change nothing, as they should > theoretically fully utilize all of their quota in each period. For > user-interactive tasks as described above this provides a much better > user/application experience as their cpu utilization will more closely > match the amount they requested when they hit throttling. > > This greatly improves performance of high-thread-count, interactive > applications with low cfs_quota_us allocation on high-core-count > machines. In the case of an artificial testcase, this performance > discrepancy has been observed to be almost 30x performance improvement, > while still maintaining correct cpu quota restrictions albeit over > longer time intervals than cpu.cfs_period_us. > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > Signed-off-by: Dave Chiluk <chiluk+linux@xxxxxxxxxx> > --- > kernel/sched/fair.c | 71 +++++----------------------------------------------- > kernel/sched/sched.h | 4 --- > 2 files changed, 6 insertions(+), 69 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index fdab7eb..b0c3d76 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4291,8 +4291,6 @@ void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b) > > now = sched_clock_cpu(smp_processor_id()); > cfs_b->runtime = cfs_b->quota; > - cfs_b->runtime_expires = now + ktime_to_ns(cfs_b->period); > - cfs_b->expires_seq++; > } > > static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group *tg) > @@ -4314,8 +4312,7 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > { > struct task_group *tg = cfs_rq->tg; > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(tg); > - u64 amount = 0, min_amount, expires; > - int expires_seq; > + u64 amount = 0, min_amount; > > /* note: this is a positive sum as runtime_remaining <= 0 */ > min_amount = sched_cfs_bandwidth_slice() - cfs_rq->runtime_remaining; > @@ -4332,61 +4329,17 @@ static int assign_cfs_rq_runtime(struct cfs_rq *cfs_rq) > cfs_b->idle = 0; > } > } > - expires_seq = cfs_b->expires_seq; > - expires = cfs_b->runtime_expires; > raw_spin_unlock(&cfs_b->lock); > > cfs_rq->runtime_remaining += amount; > - /* > - * we may have advanced our local expiration to account for allowed > - * spread between our sched_clock and the one on which runtime was > - * issued. > - */ > - if (cfs_rq->expires_seq != expires_seq) { > - cfs_rq->expires_seq = expires_seq; > - cfs_rq->runtime_expires = expires; > - } > > return cfs_rq->runtime_remaining > 0; > } > > -/* > - * Note: This depends on the synchronization provided by sched_clock and the > - * fact that rq->clock snapshots this value. > - */ > -static void expire_cfs_rq_runtime(struct cfs_rq *cfs_rq) > -{ > - struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > - > - /* if the deadline is ahead of our clock, nothing to do */ > - if (likely((s64)(rq_clock(rq_of(cfs_rq)) - cfs_rq->runtime_expires) < 0)) > - return; > - > - if (cfs_rq->runtime_remaining < 0) > - return; > - > - /* > - * If the local deadline has passed we have to consider the > - * possibility that our sched_clock is 'fast' and the global deadline > - * has not truly expired. > - * > - * Fortunately we can check determine whether this the case by checking > - * whether the global deadline(cfs_b->expires_seq) has advanced. > - */ > - if (cfs_rq->expires_seq == cfs_b->expires_seq) { > - /* extend local deadline, drift is bounded above by 2 ticks */ > - cfs_rq->runtime_expires += TICK_NSEC; > - } else { > - /* global deadline is ahead, expiration has passed */ > - cfs_rq->runtime_remaining = 0; > - } > -} > - > static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) > { > /* dock delta_exec before expiring quota (as it could span periods) */ > cfs_rq->runtime_remaining -= delta_exec; > - expire_cfs_rq_runtime(cfs_rq); > > if (likely(cfs_rq->runtime_remaining > 0)) > return; > @@ -4577,8 +4530,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > resched_curr(rq); > } > > -static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > - u64 remaining, u64 expires) > +static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, u64 remaining) > { > struct cfs_rq *cfs_rq; > u64 runtime; > @@ -4600,7 +4552,6 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > remaining -= runtime; > > cfs_rq->runtime_remaining += runtime; > - cfs_rq->runtime_expires = expires; > > /* we check whether we're throttled above */ > if (cfs_rq->runtime_remaining > 0) > @@ -4625,7 +4576,7 @@ static u64 distribute_cfs_runtime(struct cfs_bandwidth *cfs_b, > */ > static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, unsigned long flags) > { > - u64 runtime, runtime_expires; > + u64 runtime; > int throttled; > > /* no need to continue the timer with no bandwidth constraint */ > @@ -4653,8 +4604,6 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > /* account preceding periods in which throttling occurred */ > cfs_b->nr_throttled += overrun; > > - runtime_expires = cfs_b->runtime_expires; > - > /* > * This check is repeated as we are holding onto the new bandwidth while > * we unthrottle. This can potentially race with an unthrottled group > @@ -4667,8 +4616,7 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > cfs_b->distribute_running = 1; > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > /* we can't nest cfs_b->lock while distributing bandwidth */ > - runtime = distribute_cfs_runtime(cfs_b, runtime, > - runtime_expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime); > raw_spin_lock_irqsave(&cfs_b->lock, flags); > > cfs_b->distribute_running = 0; > @@ -4745,8 +4693,7 @@ static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > return; > > raw_spin_lock(&cfs_b->lock); > - if (cfs_b->quota != RUNTIME_INF && > - cfs_rq->runtime_expires == cfs_b->runtime_expires) { > + if (cfs_b->quota != RUNTIME_INF) { > cfs_b->runtime += slack_runtime; > > /* we are under rq->lock, defer unthrottling using a timer */ > @@ -4779,7 +4726,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > { > u64 runtime = 0, slice = sched_cfs_bandwidth_slice(); > unsigned long flags; > - u64 expires; > > /* confirm we're still not at a refresh boundary */ > raw_spin_lock_irqsave(&cfs_b->lock, flags); > @@ -4796,7 +4742,6 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice) > runtime = cfs_b->runtime; > > - expires = cfs_b->runtime_expires; > if (runtime) > cfs_b->distribute_running = 1; > > @@ -4805,11 +4750,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > if (!runtime) > return; > > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > - if (expires == cfs_b->runtime_expires) > - lsub_positive(&cfs_b->runtime, runtime); > cfs_b->distribute_running = 0; > raw_spin_unlock_irqrestore(&cfs_b->lock, flags); > } > @@ -4940,8 +4883,6 @@ void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b) > > cfs_b->period_active = 1; > overrun = hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period); > - cfs_b->runtime_expires += (overrun + 1) * ktime_to_ns(cfs_b->period); > - cfs_b->expires_seq++; > hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED); > } > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index efa686e..69d9bf9 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -341,8 +341,6 @@ struct cfs_bandwidth { > u64 quota; > u64 runtime; > s64 hierarchical_quota; > - u64 runtime_expires; > - int expires_seq; > > short idle; > short period_active; > @@ -562,8 +560,6 @@ struct cfs_rq { > > #ifdef CONFIG_CFS_BANDWIDTH > int runtime_enabled; > - int expires_seq; > - u64 runtime_expires; > s64 runtime_remaining; > > u64 throttled_clock; > -- > 1.8.3.1 >