Re: [PATCH v5 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

bsegall@xxxxxxxxxx · Tue, 16 Jul 2019 12:58:06 -0700

Dave Chiluk <chiluk+linux@xxxxxxxxxx> writes:

> It has been observed, that highly-threaded, non-cpu-bound applications
> running under cpu.cfs_quota_us constraints can hit a high percentage of
> periods throttled while simultaneously not consuming the allocated
> amount of quota. This use case is typical of user-interactive non-cpu
> bound applications, such as those running in kubernetes or mesos when
> run on multiple cpu cores.
>
> This has been root caused to threads being allocated per cpu bandwidth
> slices, and then not fully using that slice within the period. At which
> point the slice and quota expires. This expiration of unused slice
> results in applications not being able to utilize the quota for which
> they are allocated.
>
> The expiration of per-cpu slices was recently fixed by
> 'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
> condition")'. Prior to that it appears that this has been broken since
> at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
> cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
> added the following conditional which resulted in slices never being
> expired.
>
> if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
> 	/* extend local deadline, drift is bounded above by 2 ticks */
> 	cfs_rq->runtime_expires += TICK_NSEC;
>
> Because this was broken for nearly 5 years, and has recently been fixed
> and is now being noticed by many users running kubernetes
> (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
> that the mechanisms around expiring runtime should be removed
> altogether.
>
> This allows quota already allocated to per-cpu run-queues to live longer
> than the period boundary. This allows threads on runqueues that do not
> use much CPU to continue to use their remaining slice over a longer
> period of time than cpu.cfs_period_us. However, this helps prevents the
> above condition of hitting throttling while also not fully utilizing
> your cpu quota.
>
> This theoretically allows a machine to use slightly more than its
> allotted quota in some periods. This overflow would be bounded by the
> remaining quota left on each per-cpu runqueueu. This is typically no
> more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
> change nothing, as they should theoretically fully utilize all of their
> quota in each period. For user-interactive tasks as described above this
> provides a much better user/application experience as their cpu
> utilization will more closely match the amount they requested when they
> hit throttling. This means that cpu limits no longer strictly apply per
> period for non-cpu bound applications, but that they are still accurate
> over longer timeframes.
>
> This greatly improves performance of high-thread-count, non-cpu bound
> applications with low cfs_quota_us allocation on high-core-count
> machines. In the case of an artificial testcase (10ms/100ms of quota on
> 80 CPU machine), this commit resulted in almost 30x performance
> improvement, while still maintaining correct cpu quota restrictions.
> That testcase is available at https://github.com/indeedeng/fibtest.
>
> Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
> Signed-off-by: Dave Chiluk <chiluk+linux@xxxxxxxxxx>
> ---
>  Documentation/scheduler/sched-bwc.txt | 73 ++++++++++++++++++++++++++++-------
>  kernel/sched/fair.c                   | 71 +++-------------------------------
>  kernel/sched/sched.h                  |  4 --
>  3 files changed, 65 insertions(+), 83 deletions(-)
>
> [...]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index f35930f..a675c69 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4809,11 +4754,9 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>  	if (!runtime)
>  		return;
>  
> -	runtime = distribute_cfs_runtime(cfs_b, runtime, expires);
> +	runtime = distribute_cfs_runtime(cfs_b, runtime);
>  
>  	raw_spin_lock_irqsave(&cfs_b->lock, flags);
> -	if (expires == cfs_b->runtime_expires)
> -		lsub_positive(&cfs_b->runtime, runtime);

The lsub_positive is still needed, just get rid of the if.

Other than that,
Reviewed-by: Ben Segall <bsegall@xxxxxxxxxx>