Dave Chiluk <chiluk+linux@xxxxxxxxxx> writes: > It has been observed, that highly-threaded, non-cpu-bound applications > running under cpu.cfs_quota_us constraints can hit a high percentage of > periods throttled while simultaneously not consuming the allocated > amount of quota. This use case is typical of user-interactive non-cpu > bound applications, such as those running in kubernetes or mesos when > run on multiple cpu cores. > > This has been root caused to threads being allocated per cpu bandwidth > slices, and then not fully using that slice within the period. At which > point the slice and quota expires. This expiration of unused slice > results in applications not being able to utilize the quota for which > they are allocated. > > The expiration of per-cpu slices was recently fixed by > 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift > condition")'. Prior to that it appears that this has been broken since > at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some > cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That > added the following conditional which resulted in slices never being > expired. > > if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { > /* extend local deadline, drift is bounded above by 2 ticks */ > cfs_rq->runtime_expires += TICK_NSEC; > > Because this was broken for nearly 5 years, and has recently been fixed > and is now being noticed by many users running kubernetes > (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion > that the mechanisms around expiring runtime should be removed > altogether. > > This allows quota already allocated to per-cpu run-queues to live longer > than the period boundary. This allows threads on runqueues that do not > use much CPU to continue to use their remaining slice over a longer > period of time than cpu.cfs_period_us. However, this helps prevents the > above condition of hitting throttling while also not fully utilizing > your cpu quota. > > This theoretically allows a machine to use slightly more than its > allotted quota in some periods. This overflow would be bounded by the > remaining quota left on each per-cpu runqueueu. This is typically no > more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will > change nothing, as they should theoretically fully utilize all of their > quota in each period. For user-interactive tasks as described above this > provides a much better user/application experience as their cpu > utilization will more closely match the amount they requested when they > hit throttling. This means that cpu limits no longer strictly apply per > period for non-cpu bound applications, but that they are still accurate > over longer timeframes. > > This greatly improves performance of high-thread-count, non-cpu bound > applications with low cfs_quota_us allocation on high-core-count > machines. In the case of an artificial testcase (10ms/100ms of quota on > 80 CPU machine), this commit resulted in almost 30x performance > improvement, while still maintaining correct cpu quota restrictions. > That testcase is available at https://github.com/indeedeng/fibtest. > > Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition") > Signed-off-by: Dave Chiluk <chiluk+linux@xxxxxxxxxx> > --- > Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++------- > kernel/sched/fair.c | 71 +++------------------------------- > kernel/sched/sched.h | 4 -- > 3 files changed, 65 insertions(+), 83 deletions(-) > > [...] > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index f35930f..a675c69 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b) > if (!runtime) > return; > > - runtime = distribute_cfs_runtime(cfs_b, runtime, expires); > + runtime = distribute_cfs_runtime(cfs_b, runtime); > > raw_spin_lock_irqsave(&cfs_b->lock, flags); > - if (expires == cfs_b->runtime_expires) > - lsub_positive(&cfs_b->runtime, runtime); The lsub_positive is still needed, just get rid of the if. Other than that, Reviewed-by: Ben Segall <bsegall@xxxxxxxxxx>