Dave Chiluk <chiluk+linux@xxxxxxxxxx> writes: > It has been observed, that highly-threaded, user-interactive > applications running under cpu.cfs_quota_us constraints can hit a high > percentage of periods throttled while simultaneously not consuming the > allocated amount of quota. This impacts user-interactive non-cpu bound > applications, such as those running in kubernetes or mesos when run on > multiple cores. > > This has been root caused to threads being allocated per cpu bandwidth > slices, and then not fully using that slice within the period. This > results in min_cfs_rq_runtime remaining on each per-cpu cfs_rq. At the > end of the period this remaining quota goes unused and expires. This > expiration of unused time on per-cpu runqueues results in applications > under-utilizing their quota while simultaneously hitting throttling. > > The solution is to return all spare cfs_rq->runtime_remaining when > cfs_b->runtime nears the sched_cfs_bandwidth_slice. This balances the > desire to prevent cfs_rq from always pulling quota with the desire to > allow applications to fully utilize their quota. > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > Signed-off-by: Dave Chiluk <chiluk+linux@xxxxxxxxxx> > --- > kernel/sched/fair.c | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f35930f..4894eda 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4695,7 +4695,9 @@ static int do_sched_cfs_period_timer(struct cfs_bandwidth *cfs_b, int overrun, u > return 1; > } > > -/* a cfs_rq won't donate quota below this amount */ > +/* a cfs_rq won't donate quota below this amount unless cfs_b has very little > + * remaining runtime. > + */ > static const u64 min_cfs_rq_runtime = 1 * NSEC_PER_MSEC; > /* minimum remaining period time to redistribute slack quota */ > static const u64 min_bandwidth_expiration = 2 * NSEC_PER_MSEC; > @@ -4743,16 +4745,27 @@ static void start_cfs_slack_bandwidth(struct cfs_bandwidth *cfs_b) > static void __return_cfs_rq_runtime(struct cfs_rq *cfs_rq) > { > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > - s64 slack_runtime = cfs_rq->runtime_remaining - min_cfs_rq_runtime; > + s64 slack_runtime = cfs_rq->runtime_remaining; > > + /* There is no runtime to return. */ > if (slack_runtime <= 0) > return; > > raw_spin_lock(&cfs_b->lock); > if (cfs_b->quota != RUNTIME_INF && > cfs_rq->runtime_expires == cfs_b->runtime_expires) { > - cfs_b->runtime += slack_runtime; > + /* As we near 0 quota remaining on cfs_b start returning all > + * remaining runtime. This avoids stranding and then expiring > + * runtime on per-cpu cfs_rq. > + * > + * cfs->b has plenty of runtime leave min_cfs_rq_runtime of > + * runtime on this cfs_rq. > + */ > + if (cfs_b->runtime >= sched_cfs_bandwidth_slice() * 3 && > + slack_runtime > min_cfs_rq_runtime) > + slack_runtime -= min_cfs_rq_runtime; > > + cfs_b->runtime += slack_runtime; > /* we are under rq->lock, defer unthrottling using a timer */ > if (cfs_b->runtime > sched_cfs_bandwidth_slice() && > !list_empty(&cfs_b->throttled_cfs_rq)) This still has a similar cost as reducing min_cfs_rq_runtime to 0 - we now take a tg-global lock on every group se dequeue. Setting min=0 means that we have to take it on both enqueue and dequeue, while baseline means we take it once per min_cfs_rq_runtime in the worst case. In addition how much this helps is very dependent on the exact pattern of sleep/wake - you can still strand all but 15ms of runtime with a pretty reasonable pattern. If the cost of taking this global lock across all cpus without a ratelimit was somehow not a problem, I'd much prefer to just set min_cfs_rq_runtime = 0. (Assuming it is, I definitely prefer the "lie and sorta have 2x period 2x runtime" solution of removing expiration)